-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path5_datascience.rmd
383 lines (253 loc) · 13.1 KB
/
5_datascience.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
---
title: |-
R Crash Course \
Part 5 -- Data Science
author: "Rob Colautti"
---
# Setup
If you don't have it installed, then run `install.packages("dplyr")` -- it can take a while to download and install.
# 1. Introduction to Data Science
Data Science is a relatively new field of study that merges **computer science** and **statistics** to answer questions in other domains (e.g. business, medicine, biology, psychology). Data Science as a discipline has grown in popularity in response to the rapid rate of increase in data collection and publication.
Data Science often involves 'Big Data', which doesn't have a strict quantitative definition but will usually have one or more of the following characteristics:
1. High **Volume** -- large file sizes with lots of observations.
2. Wide **Variety** -- lots of different types
3. High **Velocity** -- accumulating at a high rate
4. Compromised **Veracity** -- variable quality that must be dealt otherwise downstream analyses will be compromised.
> What are some examples of 'big data' in Biology?
Medical records, remote sensing data (e.g. climate stations, satellite images), and 'omics data are good examples of 'big data' in biology.
In biology, it can be helpful to think of Data Science as a continuous life-cycle with multiple stages:
## Data Science Life-Cycle
1. **Hypothesize** -- Make initial observations of about the natural world, or insights from other data, that lead to testable hypotheses. Your core biology training is crucial here.
2. **Collect** -- This may involve taking measurements yourself, manually entering data that is not yet in electronic format, requesting data from authors of published studies, or importing data from online sources. Collecting data is a crucial step that is often done poorly. Some tips on this are provided in a [paper by Wu et al](https://www.biorxiv.org/content/early/2018/10/30/457051)
3. **Correct** -- Investigate the data for quality assurance, to identify and fix potential errors. Start to visualize the data to look for outliers or nonsensical relationships (or lack thereof).
4. **Explore** -- Try to understand the data, where they come from, and potential limitations on their use. Continue visualizing data; this may cause you to modify your hypotheses slightly.
5. **Model** -- Now that hypotheses are clearly defined, apply statistical tests of their validity.
6. **Report** -- Use visualizations along with the results of your statistical tests to summarize your findings.
7. **Repeat** -- Return to step 1.
In this tutorial, we focus mainly on coding in R for steps 2, 3, and 6. Step 5 requires a firm understanding of statistics. Step 4 is covered in the tutorials on [qplot](./2_qplot.html) and [ggplot](./3_ggplot.html). Step 1 is everything covered in a typical degree in the biological sciences.
Data collection and management are crucial steps in the Data Science Life-Cycle. Read the paper by [Wu et al](https://www.biorxiv.org/content/early/2018/10/30/457051). called *baRcodeR with PyTrackDat: Open-source labelling and tracking of biological samples for repeatable science.* Pay particular attention to the '*Data Standards*' section. The *baRcodeR* and *PyTrackDat* programs and their application to current projects may also be of interest.
## Data Science in R
The book [R for Data Science](http://shop.oreilly.com/product/0636920034407.do) by Hadley Wickham & Garrett Grolemund is an excellent resource using R in Data Science projects.
> TIP: In general, any book by Hadley Wickham that you come across is worth reading if you want to be proficient in R.
Here we'll be focusing on the `dplyr` packages from Wickham et al.
# 2. 2D Data Wrangling
The `dplyr` library in R has many useful features for importing and reorganizing your data for steps 2, 3 and 4 in the Data Science Life-Cycle outlined above. Don't forget to install the `dplyr` library and load it into memory.
```{r}
library(dplyr)
```
> Note: This error message just informs you that `dplyr` uses function or parameter names that are the same as other base or stats packages in R. These base/stats functions are 'masked' meaning that when you run one (e.g. `filter`) then R will run the `dplyr` version rather than the stats version.
We'll work with our FallopiaData.csv dataset, and remind ourselves of the structure of the data
```{r}
Fallo<-read.csv("FallopiaData.csv")
str(Fallo)
```
This file is an example of a 2-dimensional data set, which is common in biology. 2D datasets have the familiar row x column layout used by spreadsheet programs like Microsoft Excel or Google Sheets. There are some exceptions, but data in this format should typically follows 3 rules:
1. Each cell contains a single value
2. Each variable must have its own column
3. Each observation must have its own row
Making sure your data are arranged this way will usually make it much easier to work with.
## `filter()`
Let's subset observations based on value
```{r}
Pot1<-filter(Fallo,PotNum==1)
head(Pot1)
```
## `rename()`
It's possible to change the names of columns in your data. In base R you can use the `names()` function with the square bracket index `[]`:
```{r}
X<-Fallo
names(X)
names(X)[12]<-"Total_Biomass"
names(X)
```
```{r}
X<-rename(Fallo, Total_Biomass = Total)
names(X)
```
There is also a simple `dplyr` function to do this:
## `arrange()`
Use the `arrange()` function to sort the *rows* of your data based on the *columns* of your data. For example, let's re-arrange our FallopiaData.csv dataset based on Taxon (a string denoting the species of Fallopia used) and Total (a float denoting the total biomass in each pot).
```{r}
X<-arrange(Fallo, Taxon, Total)
head(X)
```
use the `desc()` function with `arrange()` to change to descending order
```{r}
X<-arrange(Fallo, Taxon, desc(Total))
head(X)
```
## `select()`
The `select()` function can be used to select a subset of columns (i.e. variables) from your data.
Suppose we only want to look at total biomass, but keep all the treatment columns:
```{r}
X<-select(Fallo, PotNum, Scenario, Nutrients, Taxon, Total)
head(X)
```
You can also use the colon `:` to select a range of columns:
```{r}
X<-select(Fallo, PotNum:Taxon, Total)
head(X)
```
Exclude columns with `-`
```{r,error=T}
X<-select(Fallo, -PotNum:Taxon, -Total)
```
> Oops, what generated that error? Take a careful look at the error message and see if you can figure it out.
The problem is we are using the range of columns between PotNum and Taxon, but in one case we are excluding and the other we are including. We need to keep both the same:
```{r}
X<-select(Fallo, -PotNum:-Taxon, Total)
head(X)
```
Or a bit more clear:
```{r}
X<-select(Fallo, -(PotNum:Taxon), Total)
head(X)
```
## `everything()`
Use the `everything()` function with `select()` to rearrange your columns without losing any:
```{r}
X<-select(Fallo, Taxon, Scenario, Nutrients, PotNum, Pct_Fallopia, everything())
head(X)
```
## `mutate()`
Suppose we want to make a new column (variable) to our data.frame object (dataset) that is the sum of biomass of Urtica and Geranium only. In base R you would use `$`:
```{r}
X<-Fallo
X$UrtSil<-X$Urtica+X$Silene
```
In the dplyr package you can use mutate
```{r}
X<-mutate(Fallo, UrtSil = Urtica + Silene)
head(X)
```
This is a lot more readable, especially when you have complicated equations or you want to add lots of new columns.
> What if you only wanted to retain the new columns and delete everything else? Try it.
Which functions did you use?
## `transmute()`
You can also use `transmute()` instead of `mutate()` + `select()`
```{r}
X<-transmute(Fallo, UrtSil = Urtica + Silene)
head(X)
```
## `summarize()` + `group_by()`
This can be useful for quickly summarizing your data, for example to find the mean or standard deviation based on a particular treatment or group.
```{r}
TrtGrp<-group_by(Fallo,Taxon,Scenario,Nutrients)
summarize(TrtGrp, Mean=mean(Total), SD=sd(Total))
```
## Weighted Mean
In our dataset, the **Taxon** column shows which of two species of *Fallopia* were used in the competition experiments. We might want to take the mean total biomass for each of the two *Fallopia* species:
```{r}
X<-group_by(Fallo,Taxon)
summarize(X, Mean=mean(Total), SD=sd(Total))
```
However, there are other factors in our experiment that may affect biomass. The *Nutrients* column tells us whether pots received high or low nutrients, and this also affects biomass:
```{r}
X<-group_by(Fallo,Nutrients)
summarize(X, Mean=mean(Total), SD=sd(Total))
```
Now imagine if our sampling design is 'unbalanced'. For example, maybe we had some plant mortality or lost some plants to a tornado. If one of the two species in the *Taxon* column had more high-nutrient pots, then it would have a higher mean. BUT, the higher mean is not an effect of the Taxon, but is simply due to the unbalanced nature of the design. We can simulate this effect by re-shuffling the species names:
```{r}
RFallo<-Fallo
set.seed(256)
RFallo$Taxon<-rbinom(nrow(RFallo),size=1,prob=0.7)
X<-group_by(RFallo,Taxon)
summarize(X, Mean=mean(Total))
```
To fix this problem, we may want to take a *weighted mean*:
```{r}
X1<-group_by(RFallo,Taxon,Scenario,Nutrients)
Y1<-summarize(X1,Mean=mean(Total))
X2<-group_by(Y1,Taxon,Scenario)
Y2<-summarize(X2,Mean=mean(Mean))
X3<-group_by(Y2,Taxon)
Y3<-summarize(X3, Mean=mean(Mean))
arrange(Y3,desc(Mean))
```
## `%>%` (pipe)
The combination `%>%` is called 'pipe' for short. Be careful though -- in Unix, 'pipe' is slightly different and uses the vertical line (often shift-backslash): `|`
The pipe is useful to combine operations without creating a whole bunch of new objects. This can save on memory use.
For example, we can re-write the weighted mean example using pipes:
```{r}
RFallo %>%
group_by(Taxon,Scenario,Nutrients) %>%
summarize(Mean=mean(Total)) %>%
group_by(Taxon,Scenario) %>%
summarize(Mean=mean(Mean)) %>%
group_by(Taxon) %>%
summarize(Mean=mean(Mean)) %>%
arrange(desc(Mean))
```
This also avoids potential for bugs in our program. Imagine if we mis-spelled 'Taxon' in our second line by accidentialy pressing 's' along with 'x'. Compare the output:
```{r,error=T}
X<-group_by(Fallo,Taxon,Scenario,Nutrients)
X<-group_by(X,Tasxon,Scenario)
X<-group_by(X,Taxon)
X<-summarize(X, Mean=mean(Total), SD=sd(Total))
arrange(X,desc(Mean))
```
```{r, error=T}
Fallo %>%
group_by(Taxon,Scenario,Nutrients) %>%
group_by(Tasxon,Scenario) %>%
group_by(Taxon) %>%
summarize(Mean=mean(Total), SD=sd(Total)) %>%
arrange(desc(Mean))
```
In both cases we get an error, but in one case we still calculate the means and sd of the two species.
> A bug that produces no output is much less dangerous than an error that gives an output. Why?
# 3. Missing Data
So far we have worked on a pristine data set that has already been edited for errors. More often datasets will contain missing values.
## `NA` and `na.rm()`
The R language uses a special object `NA` to denote missing data.
```{r}
Vec<-c(1,2,3,NA,5,6,7)
Vec
```
When a function is run on a vector or other object containing NA, the function will often return NA or give an error message:
```{r, error=T}
mean(Vec)
```
This is by design, because it is not always clear what NA means. Many functions in R include an na.rm parameter that is set to FALSE by default. Setting it to true tells the function to ignore the NA
```{r}
mean(Vec, na.rm=T)
```
## `NA` vs `0`
A common mistake students make is to put 0 for missing data. This can be a big problem when analyzing the data since the calculations are very different.
```{r}
Vec1<-c(1,2,3,NA,5,6,7)
mean(Vec1, na.rm=T)
Vec2<-c(1,2,3,0,5,6,7)
mean(Vec2, na.rm=T)
```
## `is.na()`
In large datasets you might want to check for missing values. Let's simulate this in our *FallopiaData.csv* dataset.
To set up a test dataset, randomly select 10 rows and replace the value for 'Total' with NA. The `sample` function is a
```{r}
X<-round(runif(10,min=1,max=nrow(Fallo)),0)
Fallo$Total[X]<-NA
Fallo$Total
```
Use `is.na()` to check for missing values:
```{r}
is.na(Fallo$Total)
```
Note that the output is a vector of True/False. Each cell corresponds to a value of 'Total' with TRUE indicating missing values. This is an example of a boolean variable, which has some handy properties in R.
First, we can use it as an index. For example, let's see which pots have missing 'Total' values:
```{r}
Missing<-is.na(Fallo$Total)
Fallo$PotNum[Missing]
```
Another handy trick to count missing values is:
```{r}
sum(is.na(Fallo$Total))
```
This takes advantage of the fact that the boolean TRUE/FALSE variable is equivalent to the binary 1/0 values.
# 4. Naughty Data
Naughty data contain the same information as a standard row x column (i.e. 2-dimensional) dataset but break the rules outlined above:
1. Each cell contains a single value
2. Each variable must have its own column
3. Each observation must have its own row
Examples of Naughty Data are shown in Figure 2A of the [Wu et al. manuscript](https://www.biorxiv.org/content/early/2018/10/30/457051). :
![](./Wu_Fig2.jpg)
Naughty data can be very time consuming to fix, but regular expressions can make this a bit easier (see the [Regex tutorial](https://colauttilab.github.io/RCrashCourse/4_regex.html)).