forked from rstats-tln/making-plots-with-ggplot
-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy path02-visualisation.Rmd
574 lines (414 loc) · 18.6 KB
/
02-visualisation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
---
title: "ggplot: beyond dotplots"
author: "Taavi Päll"
date: "24 9 2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Adding (more) variables by facet_wrap
We used color, shape and alpha (transparency) to display additional subset in a two-dimensional graph. Using different colours allows visual inference of the distribution of the groups under comparison. But there is apparent limit how much of such information can be accommodated onto one graph before it gets too cluttered.
In addition to reducing visual clutter and overplotting, we can use small subplots just as an another way to bring out subsets from our data. Series of small subplots (multiples) use same scale and axes allowing easier comparisons and are considered very efficient design. Fortunately, ggplot has easy way to do this: facet_wrap() and facet_grid() functions split up your dataset and generate multiple small plots arranged in an array.
facet_wrap() works with one variable and facet_grid() can use two variables.
> At the heart of quantitative reasoning is a single question: Compared to what? Small multiple designs.. answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range of problems in data presentation, small multiples are the best design solution. Edward Tufte (Envisioning Information, p. 67).
```{r}
library(tidyverse)
```
Here, we plot each class of cars on a separate subplot and we arrange plots into 2 rows:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap("class")
```
To plot combination of two variables, we use facet_grid():
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(rows = vars(drv), cols = vars(cyl))
```
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ paste(drv, "wd") + paste(cyl, "cyl"), nrow = 1)
```
Note that the variables, used for splitting up data and arranging facets row and column-wise, are specified in facet_grid() by formula: facet_grid(rows ~ columns).
If you want to omit rows or columns in facet_grid() use `. ~ var` or `var ~ .`, respectively.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter") +
facet_grid(. ~ cyl)
```
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl, scales = "free_y")
```
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl, scales = "free_x")
```
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl, scales = "free") # free scale for both: x and y axis
```
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap("drv", scales = "free_y")
```
## Exercises
1. What happens if you facet on a continuous variable?
```{r}
mpg
```
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ displ)
```
Each raw value is converted to categorical value and gets its own facet?
2. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
```
3. What plots does the following code make? What does . do?
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
```
4. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn't facet_grid() have nrow and ncol argument.
```{r}
?facet_wrap
```
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ cyl, nrow = 2, ncol = 3)
```
5. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
```
Not endorsed?
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(cyl ~ drv)
```
## Geometric objects aka geoms
To change the geom in your plot, change the __geom function__ that you add to ggplot().
For instance, to create already familiar dot plot use geom_point():
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
To create line graph with loess smooth line fitted to these dots use geom_smooth():
```{r}
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
```
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
```
Every geom function in ggplot2 takes a mapping argument.
However, note that __not every aesthetic works with every geom.__
- You could set the shape of a point, but you couldn't set the "shape" of a line.
- On the other hand, you could set the linetype of a line.
We can tweak the above plot by mapping each type of drive (drv) to different linetype.
```{r}
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, color = drv)) +
scale_color_manual(values = viridisLite::viridis(3))
```
Here, 4 stands for four-wheel drive, f for front-wheel drive, and r for rear-wheel drive.
Four-wheel cars are mapped to "solid" line, front-wheel cars to "dashed" line, and rear-wheel cars to "longdash" line.
For more linetypes and their numeric codes please have a look at R cookbook: http://www.cookbook-r.com/Graphs/Shapes_and_line_types/.
Currently, ggplot2 provides over 40 geoms:
```{r}
gg2 <- lsf.str("package:ggplot2")
gg2[grep("^geom", gg2)]
```
To learn more about any single geom, use help, like: ?geom_smooth.
Many geoms, like __geom_smooth(), use a single geometric object to display multiple rows of data__. For these geoms, __you can set the group aesthetic to a categorical variable to draw multiple objects__.
Note that in case of the group aesthetic ggplot2 does not add a legend or distinguishing features to the geoms.
You can plot for example all your bootstrapped linear model fits on one plot to visualize uncertainty, whereas all these lines are of the same color/type.
```{r}
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
```
Other aestheic mappings (color, alpha etc) similarily group your data for display but also add by default legend to the plot. To hide legend, set show.legend to FALSE:
```{r}
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv)
)
```
To display multiple geoms in the same plot, add multiple geom functions to ggplot():
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
```
Probably you notice, that if we go with aethetic mappings as we used to, by specifing them within geom function, we introduce some code duplication. This can be easily avoided by moving aes() part from geom_ to the ggplot():
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
```
Now, ggplot2 uses this mapping globally in all geoms.
If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only. This way it's possible to use different aesthetics in different layers (for example, if you wish to plot model fit over data points).
Here, we map color to the class of cars, whereas geom_smooth still plots only one line:
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
```
Importantly, you can use the same idea to specify different data for each layer:
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), aes(color = class), se = FALSE)
ggsave("example_plot.png")
```
Above, our smooth line displays just a subset of the mpg dataset, the subcompact cars. The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.
## Exercises
1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
```{r}
library(palmerpenguins)
ggplot(penguins) +
geom_histogram(aes(body_mass_g), binwidth = 100) +
facet_wrap(~species, scales = "free")
```
2. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
```
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(method = MASS::rlm, se = FALSE)
```
3. What does show.legend = FALSE do? What happens if you remove it?
4. What does the se argument to geom_smooth() do?
5. Will these two graphs look different? Why/why not?
```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
```
## Plotting statistical transformations - bar graph tricks
Bar graphs are special among ggplot geoms. This is because by default they do some calculations with data before plotting. To get an idea, please have a look at the following bar graph, created by geom_bar() function.
The chart below displays the total number of diamonds in the __diamonds__ dataset, grouped by cut.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
Let's have a look at the diamonds dataset, containing the prices and other attributes of ~54000 diamonds.
```{r}
diamonds
```
Variable count is nowhere to be found... it's quite different from other plot types, like scatterplot, that plot raw values.
Other graphs, like bar charts, calculate new values to plot:
- bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
- smoothers fit a model to your data and then plot predictions from the model.
- boxplots compute a robust summary of the distribution and then display a specially formatted box.
The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.
You can learn which stat a geom uses by inspecting the default value for the stat argument in geom_ function.
For example, ?geom_bar shows that the default value for stat is "count".
![geom_bar](plots/stat_count.png)
You can use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar():
```{r}
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
```
This works because every geom has a default stat; and every stat has a default geom, meaning that you can use geoms without worrying about its underlying statistical transformation.
There are three cases when you might want to specify stat explicitly:
1. You might want to override the default stat. For example you have alredy summarised counts or means or whatever, then you need to change the default stat in geom_bar() to "identity":
Let's create summarized dataset (don't worry about this code yet, we are going to this in the next classes):
```{r}
diamonds_summarised <- diamonds %>%
group_by(cut) %>%
summarise(N = n())
diamonds_summarised
```
Here we (re)create diamond counts plot using summary data. Note that here we need to use also y-aesthetic!
```{r}
ggplot(data = diamonds_summarised) +
geom_bar(mapping = aes(x = cut, y = N), stat = "identity")
```
OR
```{r}
ggplot(data = diamonds_summarised) +
geom_col(mapping = aes(x = cut, y = N))
```
2. You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count:
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
```
To find the variables computed by the stat, look for the help section titled "computed variables".
Why we need `group=1` argument?
There is also optional `weight=` argument. What heppent if we weight cut classes by price?
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
```
3. You might want to draw greater attention to the statistical transformation in your code. Meaning basically, that you want to plot some summary statistics like median and min/max or mean +/- SE.
Median and min/max:
```{r}
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
```
Mean and SE:
```{r}
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.data = mean_se
)
```
Other useful summary functions:
```{r}
library(ggdist)
ggplot(data = diamonds) +
stat_halfeye(
mapping = aes(x = cut, y = depth))
```
If you want to use mean +/- SD like this, you need mean_sdl() function from Hmisc package (meaning, that you need to install Hmisc).
## Position adjustments - how to get those bars side-by-side
There is more you need to know about bar charts. You can easily update diamonds cut counts by mapping cut additonally either to color or fill (whereas fill seems to be more useful):
```{r}
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut, fill = cut))
```
But what happens when we map fill to another variable in diamonds data, like clarity:
```{r}
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut, fill = clarity))
```
Wow, bars are automatically __stacked__ showing the proportions of different diamond clarity classes within cut quality classes.
If you want to get these stacked bars side-by-side, you need to change the __position adjustment__ argument, which is set to "stacked" by default. There are three other options: "identity", "dodge" and "fill".
- position = "identity" will place each object exactly where it falls in the context of the graph. Its generally not useful with bar graphs, as all bars are behind each other and this plot can be easily mixed up with position = "stacked":
```{r}
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut, fill = clarity), position = "identity")
```
Position "stacked" is naturally default in scatterplot.
- position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
```
- position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
```
There is another position adjustment function for scatterplots that helps mitigate overplotting: position = "jitter":
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
```
"jitter" adds small amount of random noise to your raw data, so that each point gets moved away from its original position. This way you can reveal very similar data points that fall into same place in plot grid.
To learn more about a position adjustment, look up the help page associated with each adjustment: ?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?position_stack.
### Exercises
1. What is the problem with this plot? How could you improve it?
```{r}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
```
2. What parameters to geom_jitter() control the amount of jittering?
3. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.
## Coordinate systems - flip your plot
The default coordinate system of ggplot2 is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point.
There are a number of other coordinate systems that are occasionally helpful.
- coord_flip() witches the x and y axes. This is useful if you want horizontal boxplots. It's also very useful for long labels: it's hard to get them to fit without overlapping on the x-axis.
```{r}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
```
Try to make this plot without flipping... x-axis labels are a mess!
No *coord_flip()* is necessary anymore to prepare horizontal bar plots, categorical variable can be directly mapped to y-axis:
Previously you had to do:
```{r}
ggplot(data = mpg, mapping = aes(x = manufacturer)) +
geom_bar() +
coord_flip()
```
Now (spot difference):
```{r}
ggplot(data = mpg, mapping = aes(y = manufacturer)) +
geom_bar()
```
One can sort car manufacturers according to their median (default) hwy fuel consumption using fct_reorder() function from forcats package:
```{r}
ggplot(data = mpg, mapping = aes(y = fct_reorder(manufacturer, hwy))) +
geom_bar()
```
We can also reorder y-axis according to number of observations in data:
```{r}
ggplot(data = mpg, mapping = aes(y = fct_reorder(manufacturer, manufacturer, table))) +
geom_bar()
```
What happens?
Flip this plot that we created previously:
```{r}
ggplot(data = diamonds) +
stat_halfeye(
mapping = aes(x = cut, y = depth))
```
### Plotting geographic spatial data
- coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2.
```{r}
if (!require("sp")) {
install.packages("sp")
}
library(sp)
# level 0 map data was downloaded from http://www.gadm.org/country
est <- read_rds("data/gadm36_EST_0_sp.rds")
est <- read_rds("data/gadm36_EST_1_sp.rds")
ggplot(est, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
ggplot(est, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
```
To be continued: add number of positive cases from last week as fill to each county.
### Excercises
1. What does labs() do? Read the documentation.
2. What’s the difference between coord_quickmap() and coord_map()?
## Grammar of graphics summary
Constructing ggplot graphs can be reduced to the following template, at minimum you need data and one geom to produce a plot.
```{r, eval=FALSE}
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
```