Skip to content

Materials for 4-day UT CCBB machine learning course

Notifications You must be signed in to change notification settings

Tahmin/maclearn

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Principles of Machine Learning for Bioinformatics

This four-day course introduces a selection of machine learning methods used in bioinformatic analyses with a focus on RNA-seq gene expression data. Topics covered include: unsupervised learning, dimensionality reduction and clustering; feature selection and extraction; and supervised learning methods for classification (e.g., random forests, SVM, LDA, kNN, etc.) and regression (with an emphasis on regularization methods appropriate for high-dimensional problems). Participants have the opportunity to apply these methods as implemented in R and python to publicly available data.

Lecture notes are provided in the four slide decks:

The directories microarray, pcr, and rnaseq contain example data sets. Most of the remaining files in the repository are R or python scripts (most scripts are available in essentially equivalent form in both languages).

Suggested prerequisites

Recommended for students with some prior knowledge of either R or python. Participants are expected to provide their own laptops with recent versions of R and/or python installed. Students will be instructed to download several free software packages (including R packages and/or python libraries such as including pandas and sklearn).

R packages

from CRAN

The command below can be run within an R session to install most of the required packages from CRAN; some of these may take a while to install, recommend installation prior to class if you intend to run the R scripts.

install.packages(c('ada', 'caret', 'devtools', 'e1071', 'ggplot2',
                   'ggrepel', 'GGally', 'glmnet', 'MASS', 'matrixStats',
                   'pheatmap', 'randomForest', 'rpart', 'Rtsne', 'tidyr'))

from Bioconductor

The package genefilter can be installed from Bioconductor using the following code again run within an R session.

install.packages('BiocManager')
BiocManager::install('genefilter')

from github

The package sparsediscrim can be installed from github using the following code again run within an R session.

devtools::install_github('ramhiser/sparsediscrim')

Python modules

The following Python modules are used in the included scripts; again I would recommend installing prior to class if you intend to run the Python scripts:

  • numpy
  • scipy
  • pandas
  • scikit-learn
  • matplotlib
  • plotnine
  • seaborn

Scripts to study by day

Day 1: loading data, normalization, clustering

R Python Notes
LoadData.R LoadData.py
NormalizeData.R NormalizedData.py RLE- and mean-center-normalization
Clustering.R Clustering.py k-means and hierarchical clustering

Day 2: pca, knn classification, overfitting, cross-validation, feature selection

R Python Notes
PCA_intro.R
PCA.R PCA.py
KnnSim.R KnnSim.py compare resub vs. test performance on simulated data
KnnSimCV.R KnnSimCV.py show cross-validation (cv) removes resub bias
BadFeatSel.R BadFeatSel.py supervised feature selection must be done under cv
KnnGrid.R KnnGrid.py compare cv acc for varying k parameter on real data
KnnReal.R KnnReal.py t-test feature selection/extraction + knn on real data

Day 3: linear models, regularization, naive bayes

R Python Notes
TTesting.R TTesting.py
PredictingGeneExpression.R PredictionGeneExpression.py
WhyRegularize.R WhyRegularize.py
LogisticReal.R LogisticReal.py
LdaIsLikeLogistic.R

Day 4: svm, bootstrap, trees, random forests, boosting

R Python Notes
SvmReal.R SvmReal.py
bootstrap_examples.R mostly taken from package bootstrap examples
KnnSimBoot.R
RandomForestReal.R RandomForestReal.py
AdaBoostReal.R AdaBoostReal.py
CompareModelStrats.R CompareModelStrats.py

About

Materials for 4-day UT CCBB machine learning course

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 52.0%
  • R 26.7%
  • Python 21.3%