-
Notifications
You must be signed in to change notification settings - Fork 32
/
Copy pathhome_precourse.qmd
468 lines (374 loc) · 20.2 KB
/
home_precourse.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
---
title: "Pre-course setup"
subtitle: "Setup to be completed before the workshop"
pagetitle: "Pre-course setup | Tools for Reproducible Research"
description: ""
date: ""
format: html
---
All of the tutorials and the material in them is dependent on the GitHub
repository for the course. The first step of the setup is thus to download all
the files that you will need, which is done differently depending on which
operating system you have.
On the last day, in the _Putting the pieces together_ session we will give
examples of how we use these tools in our day-to-day work. During the course,
spend some time thinking about how these tools could be useful for you in your
own project(s). After you've gone through the tutorial you may feel that some of
the tools could make your life easier from the get-go, while others may take
some time to implement efficiently (and some you may never use again after the
course). Our idea with the _Putting the pieces together_ session is to have an
open discussion about where to go from here.
## Setup for Mac / Linux users
First, `cd` into a directory on your computer (or create one) where it makes
sense to download the course directory.
```bash
cd /path/to/your/directory
git clone https://github.com/NBISweden/workshop-reproducible-research.git
cd workshop-reproducible-research
```
::: {.callout-tip}
If you want to revisit the material from an older instance of this course,
you can do that using `git switch -d tags/<tag-name>`, e.g.
`git switch -d tags/course_1905`. To list all available tags, use `git tag`.
Run this command after you have `cd` into `workshop-reproducible-research`
as described above. If you do that, you probably also want to view the
same older version of this website. Until spring 2021, the website was
hosted at [ReadTheDocs](https://nbis-reproducible-research.readthedocs.io).
Locate the version box in the bottom right corner of the website and
select the corresponding version.
:::
## Setup for Windows users
Using a Windows computer for bioinformatic work has sadly not been ideal most of
the time, but large advances in recent years have made this quite feasible
through the _Windows Subsystem for Linux (WSL)_. This is the only setup for Windows
users that we allow for participants of this course, as all the material has
been created and tested to work on Unix-based systems.
There are two substantially different versions of the Linux subsystem, WSL1 and
WSL2. We strongly recommend using **WSL2**, which offers an essentially complete
Linux experience and better performance.
Using the Linux subsystem will give you access to a full command-line bash shell
and a Linux implementation on your Windows 10 or 11 PC. For the difference between
the Linux Bash Shell and the Windows PowerShell, see _e.g._ [this article](https://searchitoperations.techtarget.com/tip/On-Windows-PowerShell-vs-Bash-comparison-gets-interesting).
Install WSL2 on Windows 10 or 11, follow the instructions at _e.g._ one of these
resources:
- [Installing the Windows Subsystem and the Linux Bash](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
- [Installing and using Linux Bash on Windows](https://www.howtogeek.com/249966/how-to-install-and-use-the-linux-bash-shell-on-windows-10/)
- [Installing Linux Bash on Windows](https://itsfoss.com/install-bash-on-windows/)
::: {.callout-note}
If you run into error messages when trying to download files through the Linux
shell (_e.g._ `curl:(6) Could not resolve host`) then try adding the Google
name server to the internet configuration by running `sudo nano
/etc/resolv.conf` then add `nameserver 8.8.8.8` to the bottom of the file and
save it.
:::
::: {.callout-caution}
Whenever a setup instruction specifies Mac or Linux (_i.e._ only those two, with
no alternative for Windows), **please follow the Linux instructions.**
:::
Open a bash shell Linux terminal and clone the GitHub repository containing all
files you will need for completing the tutorials as follows. First, `cd` into
a directory on your computer (or create one) where it makes sense to download
the course directory.
::: {.callout-tip}
You can find the directory where the Linux distribution is storing all its files
by typing `explorer.exe .`. This will launch the Windows File Explorer showing
the current Linux directory. Alternatively, you can find the Windows C drive
from within the bash shell Linux terminal by navigating to `/mnt/c/`.
:::
```bash
cd /path/to/your/directory
git clone https://github.com/NBISweden/workshop-reproducible-research.git
cd workshop-reproducible-research
```
## Installing Git
Chances are that you already have git installed on your computer, but you should
have at least version `2.28` in order to follow the material in this course; you
can check by running _e.g._ `git --version`. If you don't have Git (or need to
update it), follow the instructions [here](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).
If you're on a Mac you can also install it using [Homebrew](https://brew.sh/)
and simple `brew install git`.
### Configure git
If it is the first time you use git on your computer, you may want to configure
it so that it is aware of your username and email. These should match those that
you have registered on GitHub. This will make it easier when you want to sync
local changes with your remote GitHub repository.
```
git config --global user.name "Mona Lisa"
git config --global user.email "[email protected]"
```
::: {.callout-tip}
If you have several accounts (_e.g._ both a GitHub and Bitbucket account), and
thereby several different usernames, you can configure git on a per-repository
level. Change directory into the relevant local git repository and run `git
config user.name "Mona Lisa"`. This will set the default username for that
repository only.
:::
You will also need to configure the default branch name to be `main` instead of
`master`:
```bash
git config --global init.defaultBranch "main"
```
The short version of why you need to do this is that GitHub uses `main` as the
default branch while Git itself is still using `master`; please read the box
below for more information.
::: {.callout-note title="The default branch name"}
The default branch name for Git and many of the online resources for hosting Git
repositories has traditionally been `master`, which historically comes from the
"master/slave" repositories of
[BitKeeper](https://mail.gnome.org/archives/desktop-devel-list/2019-May/msg00066.html).
This has been heavily discussed and in 2020 the decision was made by many
([including GitHub](https://sfconservancy.org/news/2020/jun/23/gitbranchname/))
to start using `main` instead. Any repository created with GitHub uses this new
naming scheme since October of 2020, and Git itself is currently discussing
implementing a similar change. Git did, however, introduce the ability to set
the default branch name when using `git init` in [version
2.28](https://github.blog/2020-07-27-highlights-from-git-2-28/#introducing-init-defaultbranch),
instead of using a hard-coded `master`. We at NBIS want to be a part of this
change, so we have chosen to use `main` for this course.
:::
### GitHub setup
[GitHub](https://github.com) is one of several online hosting platforms for Git
repositories. We'll go through the details regarding how Git and GitHub are
connected in the course itself, so for now we'll stick to setting up your
account and credentials.
If you have not done so already, go to [github.com](https://github.com/join) and
create an account. You can also create an account on another online hosting
service for version control, _e.g._ [Bitbucket](https://bitbucket.org) or
[GitLab](https://about.gitlab.com/). The exercises in this course are written
with examples from GitHub (as that is the most popular platform with the most
extensive features), but the same thing can be done on alternative services,
although the exact menu structure and link placements differ.
Any upload to and from GitHub requires you to authenticate yourself. GitHub
used to allow authentication with your account and password, but this is no
longer the case - using SSH keys is required instead. Knowing exactly what these
are is not necessary to get them working, but we encourage you to read the box
below to learn more about them! GitHub has excellent, platform-specific
instructions both on how to [generate](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent)
and [add](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account)
SSH keys to your account, so please follow those instructions.
::: {.callout-note title="SSH keys and authentication"}
Using SSH (Secure Shell) for authentication basically entails setting up a
pair of keys: one private and one public. You keep the private key on your
local computer and give the public key to anywhere you want to be able to
connect to, _e.g._ GitHub. The public key can be used to encrypt messages that
_only_ the corresponding private key can decrypt. A simplified description of
how SSH authentication works goes like this:
1. The client (_i.e._ the local computer) sends the ID of the SSH key pair it
would like to use for authentication to the server (_e.g._ GitHub)
2. If that ID is found, the server generates a random number and encrypts this
with the public key and sends it back to the client
3. The client decrypts the random number with the private key and sends it
back to the server
Notice that the private key always remains on the client's side and is never
transferred over the connection; the ability to decrypt messages encrypted
with the public key is enough to ascertain the client's authenticity. This is
in contrast with using passwords, which are themselves sent across a
connection (albeit encrypted). It is also important to note that even though
the keys come in pairs it is impossible to derive the private key from the
public key. If you want to read more details about how SSH authentication work
you can check out [this website](https://www.digitalocean.com/community/tutorials/understanding-the-ssh-encryption-and-connection-process),
which has more in-depth information than we provide here.
:::
## Installing Conda
Conda is installed with a [Miniforge](https://github.com/conda-forge/miniforge)
installer specific for your operating system:
```bash
# Install Miniforge for 64-bit Mac
curl -L https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-x86_64.sh -O
bash Miniforge3-MacOSX-x86_64.sh
rm Miniforge3-MacOSX-x86_64.sh
```
```bash
# Install Miniforge for 64-bit Mac (Apple chip)
curl -L https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh -O
bash Miniforge3-MacOSX-arm64.sh
rm Miniforge3-MacOSX-arm64.sh
```
```bash
# Install Miniforge for 64-bit Linux
curl -L https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -O
bash Miniforge3-Linux-x86_64.sh
rm Miniforge3-Linux-x86_64.sh
```
The installer will ask you questions during the installation:
- Do you accept the license terms? (Yes)
- Do you accept the installation path or do you want to choose a different one?
(Probably yes)
- Do you want the installer to initialize Miniforge (Yes)
Restart your shell so that the settings in `~/.bashrc` or `~/.bash_profile` can
take effect. You can verify that the installation worked by running:
```bash
conda --version
```
### If you already have Conda installed
If you already have installed Conda you can make sure you're using the latest
version by running `conda update -n base conda` and skip the installation
instructions below.
### Configuring Conda
As a last step we will set up the default channels (from where packages will be
searched for and downloaded if no channel is specified):
```
conda config --add channels bioconda
conda config --add channels conda-forge
```
And we will also set so called 'strict' channel priority, which ensures higher
stability and better performance (see details about this setting by running the
following:
```
conda config --set channel_priority strict
```
::: {.callout-caution}
The Conda docs specify a couple of things to keep in mind when using Conda.
First of all, `conda` should be installed in the `base` environment and no other
packages should be installed into `base`. Furthermore, mixing of the
`conda-forge` and `defaults` channels should be avoided as the default Anaconda
channels are incompatible with `conda-forge`. Since we are installing from
`miniforge` we get the `conda-forge` defaults without having to do anything.
:::
### Conda on new Macs
If you have one of the newer Macs with Apple chips (the M-series) you may run
into some problems with certain Conda packages that have not yet been built for
the ARM64 architecture. The [Rosetta](https://support.apple.com/en-us/HT211861)
software allows ARM64 Macs to use software built for the old AMD64 architecture,
which means you can always fall back on creating AMD/Intel-based environments
and use them in conjunction with Rosetta. This can be done by specifying
`--subdir osx-64` when creating the environment, _e.g._:
```bash
conda env create -f <path-To-Environment.yml> --subdir osx-64
```
or
```bash
conda create -n myenv <packages...> --subdir osx-64
```
::: {.callout-important}
To make sure that subsequent installations into this environment also use the
`osx-64` architecture, activate the environment and then run:
```bash
conda config --env --set subdir osx-64
```
:::
## Installing Snakemake
We will use Conda environments for the set up of this tutorial, but don't worry
if you don't understand exactly what everything does - you'll learn all the
details at the course. First make sure you're currently situated inside the
tutorials directory (`workshop-reproducible-research/tutorials`) and then create
and activate the Conda environment with the commands below:
::: {.callout-caution title="ARM64 users"}
Some of the packages in this environment are not available for the ARM64
architecture, so you'll have to add `--subdir osx-64` to the `conda env create`
command. See the [instructions above](#conda-on-new-macs) for more details.
:::
```bash
conda env create -f snakemake/environment.yml -n snakemake-env
conda activate snakemake-env
```
Check that Snakemake is installed correctly, for example by executing `snakemake
--help`. This should output a list of available Snakemake settings. If you get
`bash: snakemake: command not found` then you need to go back and ensure that
the Conda steps were successful. Once you've successfully completed the above
steps you can deactivate the environment using `conda deactivate` and continue
with the setup for the other tools.
## Installing Nextflow
The easiest way to install Nextflow is the official one, which is to just run the
following code:
```bash
curl -s https://get.nextflow.io | bash
```
This will give you the `nextflow` file in your current directory - move this file
to a directory in your `PATH`, _e.g._ `/usr/bin/`.
If you're getting Java-related errors, you can either try to [update your Java
installation](https://www.nextflow.io/docs/latest/getstarted.html#requirements)
(Nextflow requires Java 11 or later) or install Nextflow using conda. If you
want to use Conda, navigate to `workshop-reproducible-research/tutorials` and
create the environment:
```bash
conda env create -f nextflow/environment.yml -n nextflow-env
conda activate nextflow-env
```
::: {.callout-caution title="ARM64 users"}
Some of the packages in this environment is not available for the ARM64
architecture, so you'll have to follow the [instructions
above](#conda-on-new-macs).
:::
Check that Nextflow was installed correctly by running `nextflow -version`. If
you successfully installed Nextflow using Conda you can now deactivate the
environment using `conda deactivate` and continue with the other setups, as
needed.
## Installing Quarto
Installing Quarto is easiest by going to the [official
website](https://quarto.org/docs/get-started/) and downloading the
OS-appropriate package and following the installation instructions. You also
need to install a LaTeX distribution to be able to render Quarto documents to
PDF, which can be done using Quarto itself:
```bash
quarto install tinytex
```
While we're not installing Quarto _itself_ using Conda, we _will_ install some
software packages that are used in the Quarto tutorial using Conda: make sure
your working directory is in the tutorials directory (`workshop-reproducible-research/tutorials`)
and install the necessary packages defined in the `environment.yml`:
```bash
conda env create -f quarto/environment.yml -n quarto-env
```
## Installing Jupyter
Let's continue using Conda for installing software, since it's so convenient to
do so! Move into the tutorials directory (`workshop-reproducible-research/tutorials`),
create a Conda environment from the `jupyter/environment.yml` file and test
the installation of Jupyter, like so:
```bash
conda env create -f jupyter/environment.yml -n jupyter-env
conda activate jupyter-env
```
We'll do one more thing before we're done with the Jupyter setup, which is to
install the Jupyter kernel for the environment we just created. First make sure
you have the `jupyter-env` environment activated, and then run:
```bash
python -m ipykernel install --user --name jupyter-env
```
Once you've successfully completed the above steps you can deactivate the
environment using `conda deactivate` and continue with the setup for the other
tools.
## Installing Docker
Installing Docker (specifically Docker Desktop) is quite straightforward on Mac,
Windows and Linux distributions. Note that Docker runs as root, which means that
you have to have `sudo` privileges on your computer in order to install or run
Docker. When you have finished installing docker, regardless of which OS you are
on, please type `docker --version` to verify that the installation was
successful.
::: {.callout-note title="Docker for older versions of OSX/Windows"}
The latest version of Docker may not work if you have an old version of either
OSX or Windows. You can find older Docker versions that may be compatible for
you if you go to
[https://docs.docker.com/desktop/](https://docs.docker.com/desktop/) and click
"Previous versions" in the left side menu.
:::
### MacOS
Go to [docker.com](https://docs.docker.com/docker-for-mac/install/#download-docker-for-mac)
and select the download option that is suitable for your computer's architecture
(_i.e._ if you have an Intel chip or a newer Apple silicon chip). This will download
a `dmg` file - click on it when it's done to start the installation. This will
open up a window where you can drag the Docker.app to Applications. Close the
window and click the Docker app from the Applications menu. Now it's basically
just to click "next" a couple of times and we should be good to go. You can find
the Docker icon in the menu bar in the upper right part of the screen.
### Linux
Go to the [linux-install](https://docs.docker.com/desktop/install/linux-install/)
section of the Docker documentation and make sure that your computer meets the
system requirements. There you can also find instructions for different Linux
distributions in the left sidebar under _Installation per Linux distro_.
### Windows
In order to run Docker on Windows your computer must support _Hardware
Virtualisation Technology_ and virtualisation must be enabled. This is typically
done in BIOS. Setting this is outside the scope of this tutorial, so we'll
simply go ahead as if though it's enabled and hope that it works.
On Windows 10/11 we will install Docker for Windows, which is available at
[docker.com](https://docs.docker.com/docker-for-windows/install/#download-docker-for-windows).
Click the link _Download from Docker Hub_, and select _Get Docker_. Once the
download is complete, execute the file and follow the [instructions](https://docs.docker.com/docker-for-windows/install/#install-docker-desktop-on-windows).
You can now start Docker from the Start menu. You can search for it if you
cannot find it; the Docker whale icon should appear in the task bar.
You will probably need to enable integration with the Linux subsystem, if you
haven't done so during the installation of Docker Desktop. Right-click on the
Docker whale icon in the task bar and select _Settings_. Choose _Resources_ and
select _WPS integration_. Enable integration with the Linux subsystem and click
_Apply & Restart_; also restart the Linux subsystem.