-
Notifications
You must be signed in to change notification settings - Fork 15
/
Copy path3-4-automate.qmd
124 lines (94 loc) · 4.6 KB
/
3-4-automate.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
## Automate your code
Research projects that deal with data and code can be imagined as a pipeline.
Data comes in at certain points and then they are processed in several steps.
We do some data cleaning, an overview of the data, create figures, do some
modelling with the data (simulation, statistics, machine learning, etc.) and
in the end we write some text (usually a paper).
::: captioned-image-container
![Research pipeline](images/pipeline-simple.jpg){fig-alt="A pipeline."}
:::
Of course, this is a simplified view of what really happens. Most research
projects are quite complex and it is really hard to keep track of everything.
What data should be used for which analysis? What code should be used for what?
::: captioned-image-container
![Research pipeline (more realistic)](images/pipeline.jpg){fig-alt="A complex pipeline with many pipes and forks."}
:::
Good organisation and version control help us tremendously to keep track of all
the complexity. But what if the pipes of our pipeline would stick together nicely
and we would not have to execute everything manually? What if we could
automate stuff? Well, we can!
### `Make` for automation
```{r, echo=FALSE}
# this ensures that this is not rendered with jupyter :D
```
There are many automation tools out there that researchers use for their research
pipelines. The probably oldest among them is called `Make`. It is not only old,
but still very functional, useful, and versatile.
#### A simple example
Let's say we work on a project and have the following folders and files:
```{sh, echo=FALSE}
tree make-example
```
See also [here](https://github.com/BERD-NFDI/BERD-reproducible-research-course/tree/main/make-example) for all files and folders in this example project.
In this project, we want to work with data from an experiment to compare yields
(as measured by dried weight of plants) obtained under a control and two
different treatment conditions (see also Dobson, A. J., 1983, An Introduction to
Statistical Modelling. London: Chapman and Hall).
Here is an example `Makefile`, that could be used by an R user to create:
1. A preprocessed (clean) version of the plant growth data.
2. A figure (boxplot) of weight by group.
![boxplot_weight-group.png](make-example/boxplot_weight-group.png){width=70%}
```{.makefile filename="Makefile"}
data_clean/PlantGrowth_new.csv: data_raw/PlantGrowth.csv preprocess.R
Rscript preprocess.R
boxplot_weight-group.png: data_clean/PlantGrowth_new.csv overview.R
Rscript overview.R
all: boxplot_weight-group.png
```
#### `Makefile` structure
How does this `Makefile` work. A rule in `Make` consists of three components:
1. **The target**: *What do I want to generate?*
2. **The dependencies**: *What files are needed to generate the target?*
3. **The code**: *What code needs to run to generate the target from the dependencies?*
```{.makefile filename="Schematic Makefile"}
target: dependency1 dependency2
code to create target from dependencies
```
The structure is always the same:
1. The target is before the "`:`" in line 1.
2. The dependencies are after the "`:`" in line 1.
3. The code is in line 2 and indented by a tab.
Keeping the structure this way is crucial for the `Makefile` to run properly.
#### Running `Make`
`Make` runs in the terminal/console. To generate your target, you run:
```{.bash}
make target
```
so for our example that would be
```{.bash}
make data_clean/PlantGrowth_new.csv
```
to generate the clean data or
```{.bash}
make boxplot_weight-group.png
```
to generate the boxplot.
You may have noticed that the `Makefile` for our project also contains a target
called `all`. We use that in longer `Makefiles` to indicate what needs to be done
to create all targets.
In our case, all targets are generated by generating `boxplot_weight-group.png`,
because `boxplot_weight-group.png` depends on our other target `data_clean/PlantGrowth_new.csv` and thus `Make` automatically knows to generate
`data_clean/PlantGrowth_new.csv` (if needed) when running
```{.bash}
make all
```
If needed? Yes. `Make` only runs the code, if the dependencies of the target have
changed since the code was run the last time. This is a particularly useful thing
if you have long running computations and want to still ensure that everything
gets updated if needed.
### Further reading
We just showed a very simple example in this chapter. `Make` can become much more
complex and versatile. Check out the following resources if you'd like to dive
deeper:
- [Make](https://the-turing-way.netlify.app/reproducible-research/make.html), The Turing Way
- [Automating data-analysis pipelines](https://stat545.com/automating-pipeline.html), STAT 454 for R users