-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path3_Permutations.Rmd
382 lines (282 loc) · 15.8 KB
/
3_Permutations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
---
title: "Basic Linear Regression Models"
author: "Robert I. Colautti"
output: html_document
---
```{r, echo=F}
source("theme_pub.R")
library(ggplot2); theme_set(theme_pub())
```
# Introduction
This tutorial covers bootstrap, permutation and cross-validation.
# Bootstrap
## Basics
A bootstrap model gets its name from the old 19th century expression to "lift yourself up by your own bootstraps", as a metaphor for accomplishing a task without external forces. In the context of data analysis, the term bootstrap refers to the generation of 'new' data from existing data.
A very common use of a bootstrap model is to generate a null distribution from the data, to test whether the observed mean or variance is different than what we would expect by chance.
You may not realize it, but you have already taken the first step towards a bootstrap model in R by using the `sample()` function.
Let's look at the data from the Global Garlic Mustard Field Survey. The data are available [here](https://colauttilab.github.io/EcologyTutorials/GGMFS_Teaching_Data.csv). Details on this dataset are covered in a [tutorial introducting climate models(https://colauttilab.github.io/EcologyTutorials/GGMFS_climate.html).
```{r}
GMdat<-read.csv("GGMFS_Teaching_Data.csv",header=T)
GMsubDat<-GMdat[,c("Longitude","Latitude","Region","Pop_Size")] # Subset to columns of interest
GMsubDat<-GMsubDat[complete.cases(GMsubDat),] # Remove missing data
```
## Parametric Assumptions
Let's calculate the __Mean__ and __95% CI__ for `Pop_Size`. In classic frequentist statistics, we might assume a normal distribution, and then calculate the __Mean__ and __Standard Error (SE)__. You may recall from intro stats that the mean of a random sample drawn from a Gaussian distribution will fall within ~1.96x the standard error of the mean, 95% of the time. This is a complicated way of saying that the parametric 95% CI of a random sample is Mean +/- 1.96*SE. You may also recall that the SE of a sample mean is __Standard Deviation (SD)__ divided by the square root of the sample size:
$\frac{SD}{\sqrt{(N-1)}}$
```{r}
# Calculate mean, SE and CI from parametric model
X<-mean(GMsubDat$Pop_Size)
SE<-sd(GMsubDat$Pop_Size)/sqrt(nrow(GMsubDat)-1)
CI<-c(X-SE*1.96,X+SE*1.96)
# Round and print values
X<-round(X)
CI<-round(CI)
paste("Raw mean is:",X)
paste("Raw 95% CI is:",CI[1],"to",CI[2])
```
This assumes a normal distribution Let's take a look at the distribution of `Pop_Size`, and then add the mean and 95% CI
```{r}
library(ggplot2)
qplot(Pop_Size,data=GMsubDat)+geom_vline(aes(xintercept=X),colour="red")+geom_vline(aes(xintercept=CI[1]),colour="blue")+geom_vline(aes(xintercept=CI[2]),colour="blue")
```
The data highly right-skewed, ranging from near-zero to almost 1,000,000. They are clearly not normal, violating the assumption of the parametric estimate.
> How can we reduce the skew and improve assumptions of normality in the data?
## Transform
We can try to apply a transformation to make the data look more like a normal distribution:
```{r}
# Calculate mean and 95% CI on transformed data
GMsubDat$log_Size<-log(GMsubDat$Pop_Size)
Xlog<-mean(GMsubDat$log_Size)
SElog<-sd(GMsubDat$log_Size)/sqrt(nrow(GMsubDat)-1)
CIlog<-c(Xlog-SElog*1.96,Xlog+SElog*1.96)
# Plot data + mean + 95% CI
qplot(log_Size,data=GMsubDat)+geom_vline(aes(xintercept=Xlog),colour="red")+geom_vline(aes(xintercept=CIlog[1]),colour="blue")+geom_vline(aes(xintercept=CIlog[2]),colour="blue")
```
Much better, but still a bit skewed. Let's back-transform the mean and CI calculated on the log-transformed data, and compare with the original
```{r}
# Original
cat("Raw mean is:",X,"\n",
"Raw 95% CI is:",CI[1],"to",CI[2],"\n")
# Back-Transform from log-scale to original measurement scale
Xlog<-exp(Xlog)
CIlog<-exp(CIlog)
# Round
Xlog<-round(Xlog)
CIlog<-round(CIlog)
cat("Transformed mean is:",Xlog,"\n",
"Transformed 95% CI is:",CIlog[1],"to",CIlog[2],"\n")
```
Compare the raw vs. log-transformed values. That's more than an order of magnitude difference in the mean, and a much smaller 95% CI!
## Bootstrap Mean
We can also use a bootstrap model estimate the mean and 95% CI directly from the data. The beauty of this approach is that it is not biased by the distribution of the data. To do this we:
1. Resample the data, with replacement
2. Calculate the mean
3. Repeat N times
4. Calculate the mean of the N replicated sample means. This is the __bootstrap mean__.
5. The 0.025*Nth smallest mean is the lower 95% CI
6. The 0.975*Nth smallest mean is the upper 95% CI
We can do this fairly easily in R using `sample()` in a `for` loop and saving the output in a vector
```{r}
Iters<-999 # Number of iterations == the number ot times the mean is resampled
BootOut<-rep(NA,Iters) # Make an empty vector with # cells = number of iterations
for (i in 1:Iters){
# Resample the data, calculate the mean, and save it in a vector; repeat
BootOut[i]<-mean(sample(GMsubDat$Pop_Size,nrow(GMsubDat),replace=T))
}
head(BootOut)
```
Now calculate the mean and CI from the sorted data (steps 4-6). Note that the values are drawn directly from the data.
```{r}
Xboot<-mean(BootOut)
CIlow<-sort(BootOut)[floor(length(BootOut)*0.025)] # Lower 2.5% CI
CIhigh<-sort(BootOut)[ceiling(length(BootOut)*0.975)] # Upper 2.5% CI
CIboot<-c(CIlow,CIhigh)
```
Plot for comparison.
```{r}
# Plot
qplot(BootOut)+geom_vline(aes(xintercept=Xboot),colour="red")+geom_vline(aes(xintercept=CIlow),colour="blue")+geom_vline(aes(xintercept=CIhigh),colour="blue")
```
> Why does the 95% CI seem so much wider on this plot than the previous two?
This is a plot of estimated population means from the bootstrap iterations, whereas the previous graphs plot the raw data. Therefore the x-axis scales are quite different.
## Compare methods
Let's look at all of them together:
```{r}
# Transform bootstrap values
Xboot<-round(Xboot)
CIboot<-round(CIboot)
cat("Raw mean is:",X,"\n",
"Raw 95% CI is:",CI[1],"to",CI[2],"\n",
"Transformed mean is:",Xlog,"\n",
"Transformed 95% CI is:",CIlog[1],"to",CIlog[2],"\n",
"Bootstrap mean is:",Xboot,"\n",
"Bootstrap 95% CI is:",CIboot[1],"to",CIboot[2],"\n")
```
> Which method is most accurate and why?
## Bootstrap: Effect Size
We can also use a bootstrap simulation as a non-parametric statistic. In the dataset on _Alliaria petiolata_ (garlic mustard) from the Global Garlic Mustard Field Survey there is a factor called __Region__, which tells us which continent the population is on. Let's test if the North American populations are larger than populations in Europe.
## Parametric Models
```{r}
# Raw data
RawMod<-lm(Pop_Size~Region,data=GMsubDat)
(RawCoef<-summary(RawMod)$coefficients)
# Log-transformed
LogMod<-lm(log_Size~Region,data=GMsubDat)
(LogCoef<-summary(LogMod)$coefficients)
```
Recall: The 'RegionNorthAm' parameter is the estimated difference in the mean of North American population size relative to Europe. The 2nd model is on a log-scale so it needs to be back-transformed.
```{r}
# Calculate effect size and parametric 95%CI, and round
RawES<-round(RawCoef[2])
RawCI<-round(c(RawCoef[2]-1.96*RawCoef[4],RawCoef[2]+1.96*RawCoef[4]))
# Calculate for log-transformed
LogES<-exp(LogCoef[2])
LogES<-round(LogES)
LogCI<-exp(c(LogCoef[2]-1.96*LogCoef[4],LogCoef[2]+1.96*LogCoef[4]))
LogCI<-round(LogCI)
cat("The effect size based on raw data is:",RawES,"\n",
"with parametric 95% CI of",RawCI[1],"to",RawCI[2],"\n",
"The effect size based on transformed data is:","\n",
LogES,"with parametric 95% CI of",LogCI[1],"to",LogCI[2],"\n")
```
> What does the range of CI values tell you about the hypothesis that introduced populations (North America) are larger than populations in the native range (Europe)?
## A Bootstrap model
We sample, with replacement, just as we did with the above bootstrap model. However, this time we sample separately for each region and calculate the difference in the mean each time
1. Resample the data __from each region__, with replacement
2. Calculate the __difference between means__
3. Repeat N times
4. Calculate the mean of the N replicated samples. This is the __bootstrap effect size__.
5. The 0.025*Nth smallest mean is the lower 95% CI
6. The 0.975*Nth smallest mean is the upper 95% CI
```{r}
Iters<-999 # Number of iterations == the number ot times the mean is resampled
BootESOut<-rep(NA,Iters) # Make an empty vector with # cells = number of iterations
# Make separate EU/NA groups for resampling
EUpops<-GMsubDat$Pop_Size[grep("Europe",GMsubDat$Region)]
NApops<-GMsubDat$Pop_Size[grep("NorthAm",GMsubDat$Region)]
for (i in 1:Iters){
# Resample the data, calculate the mean, and save it in a vector; repeat
BootESOut[i]<-
mean(sample(NApops,length(NApops),replace=T)) -
mean(sample(EUpops,length(EUpops),replace=T))
}
head(BootESOut)
```
Look at the frequency histogram.
```{r}
qplot(BootESOut)
```
Compare the results
```{r}
# Calculate mean & 95% CI
XbootES<-round(mean(BootESOut))
CIlowES<-sort(BootESOut)[floor(length(BootESOut)*0.025)] # Lower 2.5% CI
CIhighES<-sort(BootESOut)[ceiling(length(BootESOut)*0.975)] # Upper 2.5% CI
CIbootES<-round(c(CIlow,CIhigh))
# Compare results
cat("The effect size based on raw data is:",RawES,"\n",
"with parametric 95% CI of",RawCI[1],"to",RawCI[2],"\n",
"The effect size based on transformed data is:",LogES,"\n",
"with parametric 95% CI of",LogCI[1],"to",LogCI[2],"\n",
"The effect size of the bootstrap model is:",XbootES,"\n",
"with 95% CI of",CIbootES[1],"to",RawCI[2],"\n")
```
> What do the x- and y-axis represent, and why does the distribution have this shape?
## Bootstrap Null Models
In the above model, we know our effect size is significant -- i.e. North American populations are significantly bigger than native populations. But, there are two problems:
1. We don't know the exact P-value.
2. The above model wouldn't work for __constrained data__. For example, what if we wanted to test the significance of eigenvalues from a principal components model? Eigenvalues are __constrained__ to be greater than zero. Since we calculate 95% CI from resampled data, rather than a parametric model, we can't just check if our 95% CI overlap zero.
Here's where a null model comes in handy. We first define and simulate a null model, then compare our observed value(s) to the null model.
Defining the null model is a little more tricky -- you have to think in terms of the null hypothesis. In our example, the null hypothesis is that there is no difference in the size of European and North American populations. To generate a null distribution:
1. Resample the __region codes__, __with replacement__
2. Calculate the __difference between means__
3. Repeat N times
4. Compare the observed effect size to the null distribution
5. The p-value is proportion of values in the null model greater/less than the observed effect size (or divide by 2 for a 2-tailed test)
```{r}
Iters<-999 # Number of iterations == the number ot times the mean is resampled
PermOut<-rep(NA,Iters) # Make an empty vector with # cells = number of iterations
SimDat<-GMsubDat # Make a new dataset for resampling
for (i in 1:Iters){
# 1. Reshuffle Region labels (sample, without replacement)
SimDat$Region<-sample(SimDat$Region,replace=F)
# 2. Calculate effect size and save in output vector
PermOut[i]<-
mean(SimDat$Pop_Size[grep("NorthAm",SimDat$Region)]) -
mean(SimDat$Pop_Size[grep("Europe",SimDat$Region)])
}
head(PermOut)
```
## Plot Null
```{r}
# Calculate mean difference of raw data
(RawX<-aggregate(GMsubDat$Pop_Size,list(GMsubDat$Region),mean))
RawDiff<-RawX$x[2]-RawX$x[1]
qplot(PermOut)+geom_vline(aes(xintercept=RawDiff),colour="red")
```
## Calculate p-value
The p-value is the probability of obtaining the observed value by chance. In this case the observed value is the observed difference in the mean `RawDiff`, above. The null model simulates 'observed values obtained by chance'. So the estimated p-value is the proportion of values less than the observed value __for a 1-tailed test__.
```{r}
p<-sum(PermOut>RawDiff)/length(PermOut)
paste("The estimated p-value is:",p)
```
If the observed value is lower than all of the simulated values, then the best we can say is:
$$p < \frac{1}{N_{iters}}$$
where `Niters` is the total number of iterations.
If we did not have an _a priori_ prediction about the direction of the effect, then it is a 2-tailed test. This is easy to do. We just divide alpha by two, or multiply the p-value by 2.
In our case, our null hypothesis was no difference between European vs. North American populations (rather than testing whether North American populations are bigger).
So p < 0.01 for 1001 iterations or p < 0.001 if there is no overlap in 2001 iterations.
## Bootstrap Caveats
It's important to think carefully about your bootstrap and null models and the implicit assumptions you make when you run them. For example, we assume that each observation has the same probability of being chosen in each iteration. This wouldn't be true if there was some sort of inherent correlation among data points. For example, the data we are working with were sampled non-randomly.
```{r}
qplot(Longitude,Latitude,data=GMdat,alpha=I(0.3))
```
We may also have two or more factors in a model that are correlated. For example, multiple traits measured on the same individuals or the same traits measured on the same individuals but at different time points. In this kind of 'repeated measures' experiments, should run the bootstrap and null model tests at the level of individual, not observation. That is, reshuffle ID tags for a null model, leaving the rest of the row/column data untouched.
## Make it faster
Once you understand the basic coding for bootraps and permutation tests, you can look into available packages to reduce the amount of code you need to write and have them run faster.
The `lmPerm` package is the permutation equivalent of `lm()`
The `boot` package has tools for writing bootstrap models
The `foreach` and `doParallel` packages allow you to use multithreading in your functions and for-loops.
# Cross-validation
In contrast to the bootstrap, which resamples data, cross-validation provides an alternative approach that is not as computationally demanding. Rather than resampling randomly, in each iteration, k rows of data (typically k=5 or k=10) are removed from the dataset in each iteration.
Cross-validation is useful for identifying the best-fit statistical models while avoiding over-fitting. The `cv.glm` function from the `boot` package does this quickly:
```{r}
library(boot)
mod1<-glm(Pop_Size~Latitude*Region,data=GMsubDat)
summary(mod1)
cv.err<-cv.glm(GMsubDat,mod1)
cv.err$K
cv.err$delta
```
Our `cv.err$delta` parameters are the Cross-Validation estimates of the model Mean Squared Error (MSE), which is just an average of the individual model MSEs:
$CV = \frac{\sum_{i=1}^k MSE_i}{k}$
Where MSE is just the average squared deviation of each point $y_i$ from the model prediction $\hat f(x_i)$ :
$MSE = \frac{\sum_{i=1}^n [(y_i - f(x_i)]^2}{n}$
We can try higher-order polynomials with the continuous term (Latitude)
```{r}
library(boot)
Plvl<-8
cv.err<-rep(NA,Plvl)
for (i in 1:Plvl){
mod1<-glm(Pop_Size~poly(Latitude,i),data=GMsubDat)
cv.err[i]<-cv.glm(GMsubDat,mod1)$delta[1]
mod1<-NA
}
cv.err
qplot(x=1:Plvl,cv.err)
```
And include interactions of each polynomial with Region:
```{r}
library(boot)
Plvl<-8
cv.err<-rep(NA,Plvl)
for (i in 1:Plvl){
mod1<-glm(Pop_Size~poly(Latitude,i)*Region,data=GMsubDat)
cv.err[i]<-cv.glm(GMsubDat,mod1)$delta[1]
mod1<-NA
}
cv.err
qplot(x=1:Plvl,cv.err)
```
# Assignment
Analyze a biological dataset to address any biological question (ideally one of your own or from your lab). Analyze the data using `glm()` with at least one continuous (i.e. numeric) predictor variable. Use a cross-validation approach to determine the 'best' mode. Write a short report using R markdown that explains your approach (methods), hypothesis, and inference.