Skip to content
Daniel N edited this page Apr 13, 2016 · 53 revisions

Introduction and Instructions

Welcome to the Affymetrix Tutorial. This tutorial will guide you through a basic Affymetrix experiment using freely available Affymetrix microarray data from Gene Expression Omnibus (GEO). Below is the recommended progression for advancing through this tutorial:

(1) Go through the Shared Features page to familiarize yourself with the methods shared across all microarray experiments
(2) Return to this page to go through the annotated code along with accompanying tutorial slides to become acquainted with a typical Affymetrix microarray experiment
(3) Download Affymetrix.R to work through the tutorial directly on your own computer/server
(4) Use Affymetrix.R as a template for your own Affymetrix microarray experiment, and refer to the annotated code and tutorial slides as needed

This tutorial is a great place to start learning about Affymetrix microarray experiments. However, it was designed as a 'starting point' and not as an encyclopedia of all of the problems you might encounter in your experiment. After becoming comfortable with the basics discussed here, you should have the knowledge and skills to overcome your own roadblocks in experimentation. Please refer to the Shared Features page for a discussion of how to approach unforeseen problems that can (and will!) come up in your experiments.


Tutorial Slides

Here is the link to the full presentation: Affymetrix Tutorial Slides

As you work through the annotated code, look at these slides to see examples of the figures produced during the different processing steps.


Annotated Code

Experiment Progression

Here are the steps for a basic Affymetrix microarray experiment:

1. Get Data
2. QC on Raw Data
3. Normalization
4. Batch Correction
5. Outlier Removal
6. QC on Normalized Data
7. Covariate Analysis
8. Annotating Probes
9. Collapse Rows
10. Differential Expression Analysis

1. Get Data

For this tutorial we will be using data from GEO, experiment GSE20295. This is a Parkinson's disease multi-region brain study done with human post-mortem brain tissue. In the tutorial, we will only be looking at the prefrontal cortex.

We will need the affy R library for your Affymetrix experiment, since it contains functions for reading the raw data into your R session (ReadAffy) and normalization of Affymetrix arrays (rma).

For your Affymetrix experiment, you have two choices for getting your data into your R session:

(a) You load your data directly from a source (such as GEO or ArrayExpress) into your R session
(b) You download your data to your local computer or server before loading it into your R session

Option (a) can fully utilize tools such as the GEOquery R library. Specific sources often have R libraries for downloading microarray data directly into your R session. In the code tutorial, we use GEOquery for option (a).

Option (b) is more suitable if you are using data from your own experiment, OR you need to select specific parts of an entire dataset to analyze. Sometimes with large datasets (> 500 Mb) it is better to 'pre-filter', or only extract/download data that you need - in this experiment, we only want prefrontal cortex.

You only need to use option (a) OR option (b) to get data into your R session.

(a) Load your data directly from a source into your R session (using GEOquery R library)

Use getGEO() to get phenotype data from the entire experiment, and create your datMeta.

gse <- getGEO("GSE20295", GSEMatrix =TRUE,getGPL=FALSE)
datMeta <- pData(gse[[1]])
rownames(datMeta) <- datMeta[,2]
datMeta$title <- gsub(" ","_",datMeta$title)
idx <- which(datMeta$source_name_ch1 == "Postmortem brain prefrontal cortex")

Create an index idx to later use in keeping only prefrontal cortex samples.
Next, use getGEOSuppFiles() to directly download .CEL.gz files onto your computer from the whole experiment.

getGEOSuppFiles("GSE20295")

Use software such as 7-zip to extract your new "GSE20295_RAW" folder. Also, take care to manually change GSM506039_1_1364_BA9_Pm.CEL.gz to GSM506039_1364_BA9_Pm.CEL.gz before moving on. This is an upload error in GEO - this kind of mistake definitely happens all the time.

filesPFC <- paste(datMeta$geo_accession,"_",datMeta$title,".CEL.gz",sep="")[idx]
data.affy <- ReadAffy(celfile.path = "./GSE20295/GSE20295_RAW", filenames = filesPFC)
datExpr <- exprs(data.affy)
datMeta <- datMeta[idx,]   

ReadAffy() created an affy object data.affy that only contains the prefrontal cortex samples that we are interested in. It is important for your affy object to only contain the samples you are investigating, since the normalization function rma() needs this affy object as input.

Finally, check datMeta/datExpr ordering and reformat as necessary.

GSM <- rownames(pData(data.affy))
GSM <- substr(GSM,1,9)
idx <- match(GSM, datMeta$geo_accession)
datMeta <- datMeta[idx,]  

datMeta <- datMeta[,-c(3:7,14:36)]
colnames(datMeta)[5:8] <- c("Dx","Sex","Age","Region")  
datMeta$Dx <- gsub("disease state: control","CTL",datMeta$Dx)
datMeta$Dx <- gsub("disease state: Parkinson's disease","PKD",datMeta$Dx)
datMeta$Dx <- as.factor(datMeta$Dx)
datMeta$Sex <- gsub("gender: male","M",datMeta$Sex)
datMeta$Sex <- gsub("gender: female","F",datMeta$Sex)
datMeta$Age <- gsub("age: ","",datMeta$Age)
datMeta$Region <- gsub("brain region: ","",datMeta$Region)

datMeta and datExpr Example
datMeta = (samples)x(covariates/phenotype information)
datExpr = (probes)x(samples)

(b) Download your data to your local computer or server before loading it into your R session

Use getGEO() to get phenotype data from the entire experiment, and create your datMeta.

gse <- getGEO("GSE20295",GSEMatrix =TRUE,getGPL=FALSE)
datMeta <- pData(gse[[1]])
rownames(datMeta) <- datMeta[,2]
datMeta$title <- gsub(" ","_",datMeta$title)
idx <- which(datMeta$source_name_ch1 == "Postmortem brain prefrontal cortex")

Create an index idx to later use in keeping only prefrontal cortex samples.

You can download the prefrontal cortex expression data directly from the GEO page GSE20295. For GSE20295_RAW.tar choose 'custom' download, and only select files with 'BA9' and '.CEL.gz' in the name. Download these files into a new directory 'GSE20295_PFC'. Use software such as 7-zip to extract your new "GSE20295_PFC/GSE20295_RAW" folder. Then, take care to manually change GSM506039_1_1364_BA9_Pm.CEL.gz to GSM506039_1364_BA9_Pm.CEL.gz before moving on. This is an upload error in GEO - this kind of mistake definitely happens all the time.

Next, simply use ReadAffy() to read in your pre-selected prefrontal cortex samples from your directory.

data.affy <- ReadAffy(celfile.path = "./GSE20295_PFC/GSE20295_RAW")
datExpr <- exprs(data.affy)
datMeta <- datMeta[idx,]

ReadAffy() created an affy object data.affy that only contains the prefrontal cortex samples that we are interested in. It is important for your affy object to only contain the samples you are investigating, since the normalization function rma() needs this affy object as input.

Finally, check datMeta/datExpr ordering and reformat as necessary.

GSM <- rownames(pData(data.affy))
GSM <- substr(GSM,1,9)
idx <- match(GSM, datMeta$geo_accession)
datMeta <- datMeta[idx,] 
colnames(datExpr)=rownames(datMeta)   

datMeta <- datMeta[,-c(3:7,14:36)]
colnames(datMeta)[5:8] <- c("Dx","Sex","Age","Region")  
datMeta$Dx <- gsub("disease state: control","CTL",datMeta$Dx)
datMeta$Dx <- gsub("disease state: Parkinson's disease","PKD",datMeta$Dx)
datMeta$Dx <- as.factor(datMeta$Dx)
datMeta$Sex <- gsub("gender: male","M",datMeta$Sex)
datMeta$Sex <- gsub("gender: female","F",datMeta$Sex)
datMeta$Age <- gsub("age: ","",datMeta$Age)
datMeta$Region <- gsub("brain region: ","",datMeta$Region)

datMeta and datExpr Example
datMeta = (samples)x(covariates/phenotype information)
datExpr = (probes)x(samples)

2. QC on Raw Data

Before normalizing our data, we want to take a look at the raw data to make sure that it looks OK. We do Quality Control (QC) tests to do this. First, we log2 normalize our data just for these tests (and check out the dimensions):

datExpr <- log2(datExpr)
dim(datExpr) 

There are three simple and effective tests to do:

(a) Boxplot
(b) Histogram
(c) MDS Plot

(a) Boxplot

A boxplot gives us an idea of how much gene expression each sample has as a whole. We want to make sure that the means of the samples are generally in a line.

boxplot(datExpr,range=0, col = as.numeric(datMeta$Dx), xaxt='n', xlab = "Array", main = "Boxplot", ylab = "Intensity")
legend("topright",legend = levels(datMeta$Dx),fill = as.numeric(as.factor(levels(datMeta$Dx))))  

Boxplot Example Pre-Normalization

(b) Histogram

A histogram shows us the distribution of amount of expression for each probe within a sample. The x-axis is the amount of expression, and the y-axis is what percentage of probes had this x-value of expression. Each line is a different sample. We want to make sure that the histograms of the samples are generally overlapping.

i=1 
plot(density((datExpr[,i]),na.rm=T),col = as.numeric(datMeta$Dx)[i],
     main = "Histogram", xlab="log2 exp",xlim=c(4,16),ylim=c(0,0.5))
for(i in 2:dim(datExpr)[2]){
  lines(density((datExpr[,i]),na.rm=T), col = as.numeric(datMeta$Dx)[i],)
}
legend("topright",legend = levels(datMeta$Dx),fill = as.numeric(as.factor(levels(datMeta$Dx))))

Histogram Example Pre-Normalization

(c) MDS Plot

An MDS (Multi-Dimensional Scaling) plot uses principal coordinate analysis to find the two vectors that capture the most amount of variance in a dataset. It is a visual representation of how similar samples are in their expression. We just want to make sure that there isn't one sample way off by itself away from all of the other samples. Usually samples will cluster based on batch at this stage, which we verify later on in the analysis.

mds = cmdscale(dist(t(datExpr)),eig=TRUE)
plot(mds$points,col=as.numeric(datMeta$Dx),pch=19,main="MDS")
legend("topright",legend = levels(datMeta$Dx),fill = as.numeric(as.factor(levels(datMeta$Dx))))

MDS Example Pre-Normalization

3. Normalization

Normalization helps to remove experimental error from microarray experiments. RMA (robust multichip average) normalization is a tool designed specifically for Affymetrix arrays. It performs background correction, probe summarization, quantile normalization and log2 transformation. Background correction eliminates microarray 'noise', ie. removes the probes that bind to nothing but emit signal. Probe summarization takes the expression from each probe and summarizes it to its 'probe set' using median polishing. Each row in datExpr will be a probe set after normalization.

datExpr <- rma(data.affy, background=T, normalize=T, verbose=T)
datExpr <- exprs(datExpr)

4. Batch Correction

Sequencing run batches usually contribute the most to variance within an Affymetrix microarray dataset. It is important to remove the 'batch effects' before analyzing the dataset for phenotypic differences. First, we extract batch information (the sequencing run date) from our affy object.

batch <- protocolData(data.affy)$ScanDate
batch <- substr(batch,1,8)
batch <- as.factor(batch)
table(batch)
datMeta$Batch <- batch

plot(mds$points,col = as.numeric(datMeta$Batch),pch=19,main="MDS Batch")
legend("topright",legend = levels(datMeta$Batch),fill = as.numeric(as.factor(levels(datMeta$Batch))))

MDS Plot Post-Normalization by Batch Example

As can be seen in the MDS plot, the samples are clearly separating by batch at this stage.

Next, create a datAll expression set object which stores both datMeta and datExpr. This object will be useful going forward when we need to remove samples from our analysis. See Shared Features for more information on expression set objects.

datMeta_proc <- new("AnnotatedDataFrame",data=datMeta)
colnames(datExpr) <- rownames(datMeta)

datAll <- new("ExpressionSet",exprs=datExpr,phenoData=datMeta_proc)
rm(datExpr,datMeta,datMeta_proc)  ## we don't need them anymore!

We use ComBat from the sva R library to remove batch effects from our dataset. It uses either parametric or non-parametric empirical Bayes frameworks for adjusting data for batch effects while protecting phenotypic effects (like those of Parkinson's disease). We need to remove all singular batches before moving on, since batch effects cannot be properly measured (and consequently removed) from a single batch. From our table(batch) command, we can see that we have one singular batch, "09/04/03".

to_remove <- (pData(datAll)$Batch == "09/04/03")
datAll <- datAll[,!to_remove]
pData(datAll)$Batch <- droplevels(pData(datAll)$Batch)

We can now remove batch contributions from our expression data.

mod <- model.matrix(~pData(datAll)$Dx)   
batch <- as.factor(pData(datAll)$Batch)
datExpr.combat <- ComBat(dat = exprs(datAll),batch = batch,mod = mod)

exprs(datAll) <- datExpr.combat

mds = cmdscale(dist(t(exprs(datAll))),eig=TRUE)
plot(mds$points,col=as.numeric(pData(datAll)$Dx),pch=19,main="MDS Diagnosis Post-ComBat")
legend("topright",legend = levels(pData(datAll)$Dx),fill = as.numeric(as.factor(levels(pData(datAll)$Dx))))
plot(mds$points,col=as.numeric(pData(datAll)$Batch),pch=19,main="MDS Batch Post-ComBat")
legend("topright",legend = levels(pData(datAll)$Batch),fill = as.numeric(as.factor(levels(pData(datAll)$Batch))))

MDS Plot Post-ComBat Batch Example

MDS Plot Post-ComBat Dx Example

As can be seen in the MDS plots, ComBat effectively removes the variance due to batch from our dataset, which allows the variance due to phenotype (Control v. Parkinson's Disease) to come to the forefront.

5. Outlier Removal

Outliers can detract from 'true' phenotypic signal in a dataset. An outlier will commonly be a sample that doesn't have much variance in gene expression at all. We identify these outliers with network connectivity based statistics, and then remove these outliers from the data. This approach is especially effective when preparing data for a gene co-expression network experiment, which utilize tools such as Weighted Gene Co-Expression Network Analysis (WGCNA). In fact, we use the WGCNA R library to extract connectivity data and determine our outliers.

First, we do hierarchical clustering on our samples to visually identify potential outliers.

tree <- hclust(dist(t(exprs(datAll))), method="average")
plot(tree)

Sample Tree Example

Evidently, sample GSM506064 does not cluster with the other samples. Let's see if network statistics can confirm that this sample is an outlier.

normadj <- (0.5 + 0.5*bicor(exprs(datAll)))^2
netsummary <- fundamentalNetworkConcepts(normadj)
C <- netsummary$Connectivity
Z.C <- (C-mean(C))/sqrt(var(C))

datLabel <- pData(datAll)$Dx
plot(1:length(Z.C),Z.C,main="Outlier Plot",xlab = "Samples",ylab="Connectivity Z Score")
text(1:length(Z.C),Z.C,label=datLabel,pos=3,cex=0.6)
abline(h= -2, col="red")

to_keep <- abs(Z.C) < 2
table(to_keep)
colnames(exprs(datAll))[!to_keep]

Outliers Plot Example

Network analysis indicates that sample GSM506044 is an outlier in our dataset (because it has a connectivity z-score greater than the absolute value of two). Even though this sample was not directly implicated in our hierarchical clustering tree, we rely on this connectivity analysis to determine outliers since the connectivity measure is much more indicative of a sample that does not contribute to the group variance. The hierarchical clustering measure is more indicative of similarity in expression, which is not as important to a microarray experiment as variance across phenotypes.

So we choose to remove sample GSM506044 from our dataset.

datAll <- datAll[,to_keep]

6. QC on Normalized Data

We now perform the same three QC tests as before on our normalized, batch corrected, and outlier removed dataset.

(a) Boxplot
boxplot(exprs(datAll),range=0, col = as.numeric(pData(datAll)$Dx), xaxt='n', xlab = "Array", main = "Boxplot", ylab = "Intensity")  
legend("topright",legend = levels(pData(datAll)$Dx),fill = as.numeric(as.factor(levels(pData(datAll)$Dx))))   

Boxplot Finalized Dataset Example

The medians/means of the samples should be about the same.

(b) Histogram
i=1 
plot(density((exprs(datAll)[,i]),na.rm=T),col = as.numeric(pData(datAll)$Dx[i]),
          main = "Histogram", xlab="log2 exp")
for(i in 2:dim(exprs(datAll))[2]){
  lines(density((exprs(datAll)[,i]),na.rm=T), col = as.numeric(pData(datAll)$Dx)[i],)
}
legend("topright",legend = levels(pData(datAll)$Dx),fill = as.numeric(as.factor(levels(pData(datAll)$Dx))))

Histogram Finalized Dataset Example

The different sample lines should overlap much more strongly than before.

(c) MDS Plot
mds = cmdscale(dist(t(exprs(datAll))),eig=TRUE)
plot(mds$points,col=as.numeric(pData(datAll)$Dx),pch=19,main="MDS")
legend("topright",legend = levels(pData(datAll)$Dx),fill = as.numeric(as.factor(levels(pData(datAll)$Dx))))  

MDS Plot Finalized Dataset Example

The control and Parkinson's disease samples should clearly separate from each other.

If these three plots look similar to your raw data plots, or they simply do not follow the trends illustrated in the example plots above, then you will need to return to steps (3) - (5) to adjust your decisions (ie. remove other outliers or batches, choose to remove batch effects without protecting disease effects, etc.) and/or use alternative tools to try and draw out the phenotypic variance in your dataset. If that doesn't work, you might just have a poor dataset and you will need to either rerun your experiment or look for an alternate dataset to use. However, the steps we have gone through should work most of the time given a decently run Affymetrix microarray.

7. Covariate Analysis

We don’t want the other factors (ie. sex, age) to be correlated with disease state (Dx), since this will confound our analysis (we won't be able to tell the difference between covariate effects). Thus, we want the p-value of the t-test (or ANOVA F-test, if you have multiple disease states) of the coefficients from a linear model predicting factors such as age, gender, batch, etc. from disease state to be greater than 0.05. If this is not the case for one of your factors, then you will need to take steps to remove confounders. This step of the analysis is discussed in more detail in Shared Features.

plot(pData(datAll)$Dx, ylab="Number", main="Subjects")
for(i in c("Sex","Age","Batch")){ 
  if( i == "Sex" || i == "Batch" ){
    print(paste(i,"Character Graph",sep=" "))
    A = anova(lm(as.numeric(as.factor(pData(datAll)[,i])) ~ pData(datAll)$Dx)); p = A$"Pr(>F)"[1]   
    plot(as.factor(pData(datAll)[,i]) ~ pData(datAll)$Dx, main=paste(i," p=", signif(p,2)), ylab="", xlab="")
  }
  else{
    print(paste(i,"Number Graph",sep=" "))
    A = anova(lm(as.numeric(pData(datAll)[,i]) ~ pData(datAll)$Dx)); p = A$"Pr(>F)"[1]   
    plot(as.numeric(as.character(pData(datAll)[,i])) ~ pData(datAll)$Dx, main=paste(i," p=", signif(p,2)), ylab="", xlab="")
  }
}

Covariate Plot Example

Lucky for us, we do not have any obvious confounds in our dataset.

8. Annotating Probes

Our datExpr is still in the form of (probe sets)x(samples). We now need to match probe sets to genes. We use the biomaRt R library to annotate probes with ensembl gene IDs. We use the most recent genome build for this experiment, but you can choose to use older builds if you are attempting to compare your dataset to older datasets (which use the older builds). f and a below help us choose our filter (the microarray probe set IDs) and our attributes (our ensembl gene IDs and any other gene annotations we are interested in).

ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl",host="www.ensembl.org")

f <- listFilters(ensembl)
a <- listAttributes(ensembl)

identifier <- "affy_hg_u133a"
getinfo <- c("affy_hg_u133a", "ensembl_gene_id", "entrezgene", "external_gene_name")
geneDat <- getBM(attributes = getinfo, filters=identifier, values = rownames(exprs(datAll)),mart=ensembl)

We choose affy_hg_u133a to identify probes because this was the microarray that was used for the experiment (which we could find on the original GEO experiment page GSE20295).

idx <- match(rownames(exprs(datAll)),geneDat$affy_hg_u133a)
geneDat <- geneDat[idx,]
table(is.na(geneDat$ensembl_gene_id))  ## Output: FALSE - 20050 , TRUE - 2233

We will lose some probe sets (2,233) that could not be matched to an ensembl gene ID.

to_keep <- (is.na(geneDat$ensembl_gene_id) == FALSE)
geneDat <- geneDat[to_keep,]
datAll <- datAll[to_keep,]

dim(datAll)
dim(geneDat)

geneDat Example

The geneDat object contains annotation information matching the original probe set IDs.

9. Collapse Rows

table(duplicated(geneDat$affy_hg_u133a))     ## Output: FALSE - 20050
table(duplicated(geneDat$ensembl_gene_id))   ## Output: FALSE - 13263, TRUE - 6787

It is clear from the output of these two commands that many probe sets match the same ensembl gene. This step 'summarizes' all of the probe sets that match to the same ensembl gene, so that we will have unique ensembl genes for our rows of datExpr. Essentially, we are updating our Probe Summarization step from our prior RMA Normalization with the latest discoveries in genetics. We return to the WGCNA R library and use their CollapseRows() function to perform this step. You can choose how exactly to summarize probes, but the default setting works well in most cases. The default setting is to take the probe set with the maximum mean (across samples) as the final ensembl gene row for your datExpr.

CR <- collapseRows(exprs(datAll), rowGroup = geneDat$ensembl_gene_id, rowID = geneDat$affy_hg_u133a)
exprs(datAll) <- CR$datETcollapsed
idx <- match(CR$group2row[,"selectedRowID"], geneDat$affy_hg_u133a)
geneDat <- geneDat[idx,]
rownames(geneDat) <- geneDat$ensembl_gene_id 

We are now done processing our data! Just double check that the dimensions match up for your 'dats' and save them for future use.

dim(datAll)
dim(geneDat)

write.csv(exprs(datAll), file = "datExpr.csv")
write.csv(pData(datAll), file = "datMeta.csv")
write.csv(geneDat, file = "geneDat.csv")
save(datAll,file="datAll.RData")

10. Differential Expression Analysis

We will use the limma R library to calculate microarray differential expression. We simply construct a basic linear model and examine the coefficient (or coefficients, if you have multiple phenotypes of interest) relating to Disease State(s). For this example, we will be looking at the second coefficient, which indicates whether or not the sample has Parkinson's disease. This process is better described in the Shared Features tutorial.

pData(datAll)$Sex<-factor(pData(datAll)$Sex,levels=c("M","F"))
mod <- model.matrix(~pData(datAll)$Dx+pData(datAll)$Sex+as.numeric(pData(datAll)$Age))
fit <- lmFit(exprs(datAll),mod)
fit <- eBayes(fit)
tt <- topTable(fit,coef = 2,n = Inf,genelist = geneDat)  

The tt data frame will contain information indicating which genes are differentially expressed between the Controls and the Parkinson's disease patients.

hist(tt$P.Value)

P-value Histogram Example

It is good practice to make sure that the p-values are not evenly distributed. In an experiment with real differential expression, there will be more genes with p-values less than 0.05.