-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path3_ggplot.rmd
489 lines (323 loc) · 13.7 KB
/
3_ggplot.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
---
title: |-
R Crash Course \
Part 3 -- Customizing *ggplot()*
author: "Rob Colautti"
---
# Getting Started
Before following this tutorial, you should be familiar with the [qplot() tutorial]().
This [ggplot cheat sheet](https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) is a downloadable pdf that provides a good summary and quick-reference guide.
Load the ggplot2 library and a custom theme
```{r}
library(ggplot2)
source("http://bit.ly/theme_pub")
theme_set(theme_pub())
```
The `source` function loads an external file, in this case from the internet. The file is just a .R file with a custom function defining different aspects of the graph (e.g. text size, line width, etc.) You can open the link in a web browser or download and open in a text editor to see the file.
The `theme_set()` command sets our custom theme (`theme_pub`) to the default plotting theme. Since the theme is a function in R, we need the extra brackets: `theme_pub()`
<br>
# Graphical concepts
Begin with [this presentation](https://colauttilab.github.io/RCrashCourse/Graphics_small.pdf)
# Rules of thumb
Published graphs in professional journals can vary depending on format and discipline, but there are a number of useful 'rules of thumb' to keep in mind. These are not hard and fast rules but helpful for new researchers who aren't sure how or where to start.
## 1. Minimize 'ink'
In the old days, when most papers were actually printed and mailed to journal subscribers, black ink was expensive and printing in colour was very expensive. Printing is still expensive but of course most research articles are available online where there is no additional cost to colour or extra ink. However, the concept of minimizing ink (or pixels) can go a long way toward keeping a graph free from clutter and unnecessary distraction.
## 2. Use space wisely
Empty space is not necessarily bad, but ask yourself if it is necessary and what you want the reader to take away. Consider the next two graphs:
```{r, echo=F}
Y<-rnorm(100)+60
X<-rbinom(100,1,0.5)
Y<-Y+(10*X)
X<-as.factor(X)
qplot(X,Y)
```
> Above: Y-axis scaled to the data
```{r, echo=F}
qplot(X,Y) + scale_y_continuous(limits=c(0,max(100)))
```
> Above: Y-axis scaled between 0 and 100
What are the benefits/drawbacks of scaling the axes? When might you choose to use one over ther other?
## 3. Choose a colour palette
Colour has three basic components
a. Hue -- the amount of red vs green vs blue light
b. Saturation -- how vivid the colour is
c. Brightness -- the amount of white (vs black) in the colour
In R these can be easily defined with the `rgb()` function. For example:
`rgb(1,0,0)` -- a saturated red
`rgb(0.1,0,0)` -- a dark red (low brightness, low saturation)
`rgb(1,0.9,0.9)` -- a light red (high brightness, low saturation)
Don't underestimate the impact of choosing a good colour palette, especially for presentations. Colour theory can get a bit overwhelming but here are a few good websites to help:
* Quickly generate your own palette using [coolors](https://coolors.co)
* Use a colour wheel to find complementary colours using [Adobe](https://color.adobe.com/create)
* Browse some pre-made palettes or create one from a picture [colorfavs](http://www.colorfavs.com)
## 4. Colours have meaning
What's wrong with this graph?
```{r, echo=F}
X<-rnorm(100)
Y<-X+seq_along(X)
D<-data.frame(Lat=X,Long=Y,Temp=Y/3)
qplot(Lat,Long,colour=Temp, data=D) + scale_color_gradient(high="blue", low="red")
```
Humans naturally associate colours with particular feelings. Be mindful of these associations when choosing a colour palette
Another important consideration is that not everyone sees colour the same way. About 5% to 10% of the population has colour blindness. In order to make colour graphs readable to everyone, you can use different
## 5. Use high contrast
Colours that are too similar will be hard to distinguish
```{r, echo=F}
X<-rnorm(100)
Y<-X+seq_along(X)
D<-data.frame(Lat=X,Long=Y,Precip=rnorm(100))
qplot(Lat,Long,colour=Precip, data=D) + scale_color_gradient(high="#56B4E9", low="#56B499")
```
## 6. Keep relevant information
Make sure to include proper axis **labels** (i.e. names) and **tick marks** (i.e. numbers or categories showing the different values). These labels, along with the figure caption, should act as a stand-alone unit. The reader should be able to understand the figure without having to read through the rest of the paper.
## 7. Choose the right graph
Often the same data can be presented in different ways but some are easier to interpret than others. Think carefully about the story you want to present and the main ideas you want your reader to get from your figures. Look at these two graphs that show the same data and see which one is more intuitive
```{r}
X<-rnorm(100)
Y<-X+rnorm(100)
qplot(c(X,Y),fill=c(rep("X",100),rep("Y",100)),posit="dodge")
qplot(X,Y)
```
# Example
To get to know ggplot better, let's do a step-by-step example of a figure published in a [paper by Colautti & Lau](https://doi.org/10.1111/mec.13162) in the journal Molecular Ecology (2015)
## 1. Setup
### Import data
Download selection dataset from Colautti & Lau (2015) -- this data is archived on Dryad:
https://datadryad.org/stash/dataset/doi:10.5061/dryad.gt678
Extract the file called `Selection_Data.csv` and save it to your working directory. Then use `read.csv` to import the dataset:
```{r}
SelData<-read.csv("Selection_Data.csv",header=T)
```
#### Change column names
To make them more intuitive in R
```{r}
names(SelData)<-c("Collector","Author","Year","Journal","Vol","Species","Native","N","Fitness.measure","Trait","s","s.SE","s.P","B","B.SE","B.P")
```
#### Replace s with its absolute value
We are interested in magnitude, not direction in the meta-analysis
```{r}
SelData$s<-abs(SelData$s)
```
#### Add random variables
We'll use these later to explore some additionall ggplot options
```{r}
SelData$Rpoint<-rnorm(nrow(SelData)) # Random, normally distributed
SelData$Rgroup<-sample(c(0,1),nrow(SelData),replace=T) # Random binary value
```
#### A quick look at the data
```{r}
head(SelData)
```
#### One more thing...
Note the missing data (denoted NA)
```{r, eval=FALSE}
print(SelData$s)
```
We can subset to remove mising data
```{r}
SelData<-SelData[!is.na(SelData$s),]
```
Recall from the intro tutorial that `!` means 'not' or 'invert'
similarly, we could use `filter` from dplyr
```{r, eval=F}
library(dplyr)
SelData<-SelData %>%
filter(!is.na(s))
```
dplyr also has a convenient `drop_na` function in the `tidyr` package
```{r, eval=F}
library(tidyr)
SelData<-SelData %>%
drop_na(s)
```
<br>
***
<br>
## 2. *ggplot()* vs *qplot()*
We can create the same graph using qplot and ggplot, just the syntax changes:
### Histogram
#### qplot
```{r, error=TRUE}
BarPlot<-qplot(s, data=SelData, fill=Native, geom="bar")
print(BarPlot)
```
#### ggplot
```{r, error=TRUE}
BarPlot <-ggplot(aes(s, fill=Native), data=SelData)
print(BarPlot)
```
No layers! We only loaded in the data info for plotting
We have to specify which geom(s) we want
```{r}
BarPlot<- BarPlot + geom_bar() # info from ggplot() passed to geom_bar()
BarPlot
```
Explore the components of our BarPlot object:
```{r}
summary(BarPlot)
```
For more information on geom_bar()
```{r, eval=FALSE}
?geom_bar
```
### Bivariate geom
```{r}
BivPlot<-ggplot(data=SelData, aes(x=s, y=Rpoint)) + geom_point()
print(BivPlot)
```
Looks like a classic log-normal variable, so let's log-transform x
```{r}
BivPlot<-ggplot(data=SelData, aes(x=log(s+1), y=Rpoint)) + geom_point()
print(BivPlot)
```
Add linear regression
```{r}
BivPlot + geom_smooth(method="lm",colour="steelblue",size=2)
```
Add separate regression lines for each group
```{r}
BivPlot + geom_smooth(method="lm",size=2,aes(group=Native,colour=Native))
```
<br>
***
<br>
## 3. Full ggplot
Recreate the selection histograms from Colautti & Lau:
1. Create separate data for native vs. introduced species
2. Use a bootstrap to estimate non-parametric mean and 95% confidence intervals
3. Plot all of the components on a single graph
### 3.1. Separate data
```{r}
NatSVals<-SelData$s[SelData$Native=="yes"] # s values for Native species
IntSVals<-SelData$s[SelData$Native=="no"] # s values for Introduced species
```
### 3.2. Bootstrap
#### 3.2a. Setup
```{r}
IterN<-100 # Number of iterations
NatSims<-{} # Dummy objects to hold output
IntSims<-{}
```
#### 3.2b. For loop
* Sample, with replacement and calculate average
* Store average in NatSims or IntSims
```{r}
for (i in 1:IterN){
NatSims[i]<-mean(sample(NatSVals,length(NatSVals),replace=T))
IntSims[i]<-mean(sample(IntSVals,length(IntSVals),replace=T))
}
```
#### 3.2c. Calculate 95% confidence intervals
Sort from low to high
```{r}
NatSims<-sort(NatSims)
IntSims<-sort(IntSims)
```
Calculate 95%iles from simulations
```{r}
CIs<-c(sort(NatSims)[round(IterN*0.025,0)], # Native, lower 2.5%
sort(NatSims)[round(IterN*0.975,0)], # Native, upper 97.5%
sort(IntSims)[round(IterN*0.025,0)], # Intro, lower 2.5%
sort(IntSims)[round(IterN*0.975,0)]) # Intro, upper 97.5%
```
### 3.3. Plot components
#### Combine output for plotting
```{r}
HistData<-data.frame(s=SelData$s,Native=SelData$Native)
```
*NOTE:* This creates a 'stacked' dataset for plotting
```{r}
p <- ggplot() + theme_classic()
p <- p + geom_freqpoly(data=HistData[HistData$Native=="yes",], aes(s,y=(..count..)/sum(..count..)),alpha = 0.6,colour="#1fcebd",size=2)
print(p) # native species histogram
p <- p + geom_freqpoly(data=HistData[HistData$Native=="no",], aes(s,y=(..count..)/sum(..count..)),alpha = 0.5,colour="#f53751",size=2)
print(p) # introduced species histogram
p <- p + geom_rect(aes(xmin=CIs[1],xmax=CIs[2],ymin=0,ymax=0.01),colour="white",fill="#1fcebd88")
print(p) # native species 95% CI bar
p <- p + geom_line(aes(x=mean(NatSims),y=c(0,0.01)),colour="#1d76bf",size=1)
print(p) # native species bootstrap mean
p <- p + geom_rect(aes(xmin=CIs[3],xmax=CIs[4],ymin=0,ymax=0.01),colour="white",fill="#f5375188")
print(p) # introduced species 95% CI bar
p <- p + geom_line(aes(x=mean(IntSims),y=c(0,0.01)),colour="#f53751",size=1)
print(p) # introduced species bootstrap mean
p <- p + ylab("Frequency") + scale_x_continuous(limits = c(0, 1.5))
print(p) # labels added, truncated x-axis
```
<br>
***
<br>
## 4. Custom theme
You can customize various aspects such as font size, line widths, colours, etc.
This is already done in the custom theme at http://bit.ly/theme_pub. However, you could save this (or the text below) as a file called "MyTheme.R" in your project directory. Then you could edit the paremeters and use `source("MyTheme.R")` at the beginning of your code/markdown/notebook to load your own theme.
```{r, eval=F}
# Clean theme for presentations & publications used in the Colautti Lab
theme_pub <- function (base_size = 12, base_family = "") {
theme_classic(base_size = base_size, base_family = base_family) %+replace%
theme(
axis.text = element_text(colour = "black"),
axis.title.x = element_text(size=16, margin=margin(t=5)),
axis.text.x = element_text(size=10),
axis.title.y = element_text(size=16,angle=90, margin=margin(r=5)),
axis.text.y = element_text(size=10),
axis.ticks = element_blank(),
panel.background = element_rect(fill="white"),
panel.border = element_blank(),
plot.title=element_text(face="bold", size=20),
legend.position="none"
)
}
```
<br>
***
<br>
## 5. Multi-graph
In the qplot tutorial, we looked at facets. Facets allow us to plot the same graph types but separated by category. Sometimes you might want to produce a multi-panel figure with different plots in each panel. There is a package for that...
### Setup
Install 'gridExtra' with `install.packages("gridExtra")`
```{r}
library(gridExtra)
```
### grid.arrange()
#### Combine multiple plots
Prints graphs in rows, then columns, from top left to bottom right
Use ***nrow =*** and ***ncol =*** to control layout
```{r, warning=F, message=F}
grid.arrange(p,BivPlot,BarPlot,ncol=1)
grid.arrange(p,BivPlot,BarPlot,nrow=2)
```
> Note: You might get some warnings based on missing values or wrong binwidth. You will also see some weird things with different text sizes in the graphs. Normally, you would want to fix these for a final published figure but here we are just focused on showing what is possible with the layouts.
Apply consistent formatting
```{r, warning=F, message=F}
HistPlot<-p
BarPlot<-BarPlot
BivPlot<-BivPlot
grid.arrange(HistPlot,BivPlot,BarPlot,HistPlot,nrow=2)
```
### viewport & newpage
What if we want to have graphs of different sizes? Or what if we want one figure to be inside another? We can make some even more advanced graphs using the `grid` package. F
Install 'gridExtra' if you haven't already
`install.packages("grid")`
```{r}
library(grid)
```
Control layout as a custom grid
```{r, warning=F, message=F}
grid.newpage() # Open a new page on grid device
pushViewport(viewport(layout = grid.layout(3, 2))) # Create 3x2 grid layout
print(HistPlot, vp = viewport(layout.pos.row = 3, layout.pos.col = 1:2)) # Add fig in row 3 and across columns 1:2
print(BivPlot, vp = viewport(layout.pos.row = 1:2, layout.pos.col = 1)) # add fig acros rows 1:3 in column 1
print(BarPlot, vp = viewport(layout.pos.row = 1:2, layout.pos.col = 2))
```
Use viewport to add insets
```{r, warning=F, message=F}
HistPlot
pushViewport(viewport(layout = grid.layout(4, 4))) # Create 4x4 grid layout (number of cells, will determine size/location of graph)
print(BivPlot, vp = viewport(layout.pos.row = 1:2, layout.pos.col = 3:4))
```
<br>
***
<br>
## 6. Reference
The comprehensive source for ggplot by Hadley Wickham:
http://link.springer.com/book/10.1007%2F978-0-387-98141-3
http://moderngraphics11.pbworks.com/f/ggplot2-Book09hWickham.pdf