This four-day course introduces a selection of machine learning methods used in bioinformatic analyses with a focus on RNA-seq gene expression data. Topics covered include: unsupervised learning, dimensionality reduction and clustering; feature selection and extraction; and supervised learning methods for classification (e.g., random forests, SVM, LDA, kNN, etc.) and regression (with an emphasis on regularization methods appropriate for high-dimensional problems). Participants have the opportunity to apply these methods as implemented in R and python to publicly available data.
Lecture notes are provided in the four slide decks:
The directories microarray, pcr, and rnaseq contain example data sets. Most of the remaining files in the repository are R or python scripts (most scripts are available in essentially equivalent form in both languages).
Recommended for students with some prior knowledge of either R or python. Participants are expected to provide their own laptops with recent versions of R and/or python installed. Students will be instructed to download several free software packages (including R packages and/or python libraries such as including pandas and sklearn).
The command below can be run within an R session to install most of the required packages from CRAN; some of these may take a while to install, recommend installation prior to class if you intend to run the R scripts.
install.packages(c('ada', 'caret', 'devtools', 'e1071', 'ggplot2',
'ggrepel', 'GGally', 'glmnet', 'MASS', 'matrixStats',
'pheatmap', 'randomForest', 'rpart', 'Rtsne', 'tidyr'))
The package genefilter can be installed from Bioconductor using the following code again run within an R session.
install.packages('BiocManager')
BiocManager::install('genefilter')
The package sparsediscrim can be installed from github using the following code again run within an R session.
devtools::install_github('ramhiser/sparsediscrim')
The following Python modules are used in the included scripts; again I would recommend installing prior to class if you intend to run the Python scripts:
- numpy
- scipy
- pandas
- scikit-learn
- matplotlib
- plotnine
- seaborn
R | Python | Notes |
---|---|---|
LoadData.R | LoadData.py | |
NormalizeData.R | NormalizedData.py | RLE- and mean-center-normalization |
Clustering.R | Clustering.py | k-means and hierarchical clustering |
R | Python | Notes |
---|---|---|
PCA_intro.R | ||
PCA.R | PCA.py | |
KnnSim.R | KnnSim.py | compare resub vs. test performance on simulated data |
KnnSimCV.R | KnnSimCV.py | show cross-validation (cv) removes resub bias |
BadFeatSel.R | BadFeatSel.py | supervised feature selection must be done under cv |
KnnGrid.R | KnnGrid.py | compare cv acc for varying k parameter on real data |
KnnReal.R | KnnReal.py | t-test feature selection/extraction + knn on real data |
R | Python | Notes |
---|---|---|
TTesting.R | TTesting.py | |
PredictingGeneExpression.R | PredictionGeneExpression.py | |
WhyRegularize.R | WhyRegularize.py | |
LogisticReal.R | LogisticReal.py | |
LdaIsLikeLogistic.R |
R | Python | Notes |
---|---|---|
SvmReal.R | SvmReal.py | |
bootstrap_examples.R | mostly taken from package bootstrap examples | |
KnnSimBoot.R | ||
RandomForestReal.R | RandomForestReal.py | |
AdaBoostReal.R | AdaBoostReal.py | |
CompareModelStrats.R | CompareModelStrats.py |