-
Notifications
You must be signed in to change notification settings - Fork 732
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
16 changed files
with
272 additions
and
541 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,21 +23,16 @@ Here we create Google Cloud compute instance. You may skip this step if you run | |
the case study on a local computer with GPU. | ||
|
||
```bash | ||
gcloud compute instances create "deepvariant-fast-pipeline" \ | ||
gcloud compute instances create "deepvariant-casestudy" \ | ||
--scopes "compute-rw,storage-full,cloud-platform" \ | ||
--maintenance-policy "TERMINATE" \ | ||
--accelerator=type=nvidia-tesla-p4,count=1 \ | ||
--image-family "ubuntu-2204-lts" \ | ||
--image-project "ubuntu-os-cloud" \ | ||
--machine-type "n1-standard-16" \ | ||
--boot-disk-size "100" \ | ||
--zone "us-central1-a" | ||
``` | ||
|
||
You can then ssh into the machine by running: | ||
|
||
```bash | ||
gcloud compute ssh "deepvariant-fast-pipeline" --zone us-central1-a | ||
--zone "us-central1-a" \ | ||
--min-cpu-platform "Intel Skylake" | ||
``` | ||
|
||
## Install Nvidia drivers and Nvidia container toolkit (optional) | ||
|
@@ -52,7 +47,6 @@ For this case study we used the | |
that automates the CUDA and container tools kit installation. | ||
|
||
Please note that the script takes about 30 minutes to run. | ||
|
||
```bash | ||
wget https://raw.githubusercontent.com/google/deepvariant/refs/heads/r1.8.0/scripts/install_nvidia_docker.sh | ||
chmod +x install_nvidia_docker.sh | ||
|
@@ -64,7 +58,7 @@ chmod +x install_nvidia_docker.sh | |
### Get DeepVariant Docker image | ||
|
||
```bash | ||
BIN_VERSION="1.8.0" | ||
BIN_VERSION="1.8.0-rc0" | ||
sudo docker pull google/deepvariant:"${BIN_VERSION}-gpu" | ||
``` | ||
|
||
|
@@ -86,7 +80,7 @@ gcloud storage cp gs://deepvariant/case-study-testdata/GCA_000001405.15_GRCh38_n | |
|
||
```bash | ||
mkdir -p input | ||
gcloud storage cp gs://deepvariant/pacbio-case-study-testdata/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam* input/ | ||
gcloud storage cp gs://deepvariant/pacbio-case-study-testdata/HG003.pfda_challenge.grch38.phased.chr20.bam* input/ | ||
``` | ||
|
||
## Run DeepVariant pipeline on chromosome 20 alignments | ||
|
@@ -116,7 +110,7 @@ cat <<EOM >$FILE | |
--examples=/tmp/[email protected] | ||
--gvcf=/tmp/[email protected] | ||
--mode=calling | ||
--reads=/input/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam | ||
--reads=/input/HG003.pfda_challenge.grch38.phased.chr20.bam | ||
--ref=/reference/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | ||
--alt_aligned_pileup=diff_channels | ||
--max_reads_per_partition=600 | ||
|
@@ -162,7 +156,7 @@ cat <<EOM >$FILE | |
--outfile=/output/variants.chr20.vcf | ||
--gvcf_outfile=/output/variants.gvcf.chr20.vcf | ||
--small_model_cvo_records=/tmp/[email protected] | ||
--cpus=14 | ||
--cpus=0 | ||
EOM | ||
``` | ||
|
||
|
@@ -184,7 +178,7 @@ time sudo docker run \ | |
--shm_prefix dv \ | ||
--num_shards 14 \ | ||
--buffer_size 10485760 \ | ||
2>&1 | tee /tmp/fast_pipeline.docker.log | ||
2>&1 | tee /tmp/fast_pipeline.Docker_chr20.log | ||
``` | ||
|
||
* `-v` allows to map local directory inside docker container. | ||
|
@@ -217,9 +211,9 @@ variants.gvcf.chr20.vcf | |
With the same settings the pipeline takes approximately 10 minutes. | ||
|
||
``` | ||
real 8m15.252s | ||
user 0m0.007s | ||
sys 0m0.035s | ||
real 10m35.875s | ||
user 0m0.009s | ||
sys 0m0.034s | ||
``` | ||
|
||
## Benchmark output | ||
|
@@ -236,8 +230,6 @@ curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_G | |
|
||
``` | ||
HAPPY_VERSION=v0.3.12 | ||
time sudo docker run \ | ||
-v ${PWD}/output:/output \ | ||
-v ${PWD}/benchmark:/benchmark \ | ||
-v ${PWD}/reference:/reference \ | ||
|
@@ -254,10 +246,9 @@ time sudo docker run \ | |
``` | ||
|
||
``` | ||
Benchmarking Summary: | ||
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio | ||
INDEL ALL 10628 10543 85 22403 74 11375 40 29 0.992002 0.993290 0.507744 0.992646 NaN NaN 1.748961 2.138647 | ||
INDEL PASS 10628 10543 85 22403 74 11375 40 29 0.992002 0.993290 0.507744 0.992646 NaN NaN 1.748961 2.138647 | ||
SNP ALL 70166 70101 65 105602 71 35342 12 12 0.999074 0.998989 0.334672 0.999032 2.296566 1.713281 1.883951 1.503192 | ||
SNP PASS 70166 70101 65 105602 71 35342 12 12 0.999074 0.998989 0.334672 0.999032 2.296566 1.713281 1.883951 1.503192 | ||
INDEL ALL 10628 10560 68 22592 70 11520 39 30 0.993602 0.993678 0.509915 0.993640 NaN NaN 1.748961 2.324711 | ||
INDEL PASS 10628 10560 68 22592 70 11520 39 30 0.993602 0.993678 0.509915 0.993640 NaN NaN 1.748961 2.324711 | ||
SNP ALL 70166 70142 24 104271 23 34047 7 5 0.999658 0.999672 0.326524 0.999665 2.296566 1.74197 1.883951 1.849802 | ||
SNP PASS 70166 70142 24 104271 23 34047 7 5 0.999658 0.999672 0.326524 0.999665 2.296566 1.74197 1.883951 1.849802 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,154 +1,128 @@ | ||
# DeepVariant with PacBio HiFi data | ||
# Using DeepVariant for small variant calling from PacBio HiFi reads | ||
|
||
#### Author: William Rowell <[email protected]> | ||
|
||
In this case study we describe applying DeepVariant to PacBio HiFi reads to call | ||
variants. We will call small variants from a publicly available whole genome | ||
HiFi dataset from PacBio. | ||
|
||
### Updated dataset in release 1.8.0 | ||
|
||
In release 1.8.0, we have updated the PacBio test data from HG003 Sequel-II to | ||
latest Revio with SPRQ chemistry data to showcase performance on the updated | ||
platform and chemistry. The full bam data is available [here](https://downloads.pacbcloud.com/public/revio/2024Q4/WGS/GIAB_trio/HG003/analysis/GRCh38.m84039_241002_000337_s3.hifi_reads.bc2020.bam). | ||
|
||
The dataset used in this case-study has following attributes: | ||
|
||
```bash | ||
Sample: HG003 | ||
Region: Chr20 | ||
Chemistry: REVIO SPRQ | ||
Coverage: 32x | ||
``` | ||
Starting in v1.4.0, PacBio calling uses one-step variant calling. If you're | ||
looking for documentation for the two-step process, please look at v1.3.0. | ||
|
||
## Prepare environment | ||
|
||
In this case-study, we will use [Docker](https://docs.docker.com/get-docker/) to | ||
run DeepVariant for variant calling and | ||
[hap.py](https://github.com/illumina/hap.py) for benchmarking. | ||
### Tools | ||
|
||
If you want to run on GPU machines, or use `Singularity` instead of `Docker`, | ||
please follow [Quick Start](deepvariant-quick-start.md) documentation. | ||
[Singularity](https://sylabs.io/docs/) will be used to run DeepVariant and | ||
[hap.py](https://github.com/illumina/hap.py), and we'll use | ||
[miniconda](https://docs.conda.io/en/latest/miniconda.html) and a conda | ||
environment to handle the other dependencies for the case study and samtools. | ||
|
||
### Create input and output directory structures and download inputs | ||
- singularity (must be installed by `root` user; outside of the scope of this | ||
case study) | ||
- samtools | ||
|
||
```bash | ||
BASE="${HOME}/pacbio-case-study" | ||
|
||
# Set up input and output directory data | ||
INPUT_DIR="${BASE}/input/data" | ||
OUTPUT_DIR="${BASE}/output" | ||
|
||
## Create local directory structure | ||
mkdir -p "${INPUT_DIR}" | ||
mkdir -p "${OUTPUT_DIR}" | ||
# add channels to conda configuration | ||
conda config --add channels defaults | ||
conda config --add channels bioconda | ||
conda config --add channels conda-forge | ||
|
||
# create the environment and install dependencies | ||
conda create -y -n deepvariant_env | ||
conda activate deepvariant_env | ||
conda install -y samtools==1.10 | ||
``` | ||
|
||
# Download reference to input directory | ||
FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids | ||
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > ${INPUT_DIR}/GRCh38_no_alt_analysis_set.fasta | ||
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > ${INPUT_DIR}/GRCh38_no_alt_analysis_set.fasta.fai | ||
### Download Reference | ||
|
||
HTTPDIR=https://storage.googleapis.com/deepvariant/pacbio-case-study-testdata | ||
curl ${HTTPDIR}/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam > ${INPUT_DIR}/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam | ||
curl ${HTTPDIR}/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam.bai > ${INPUT_DIR}/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam.bai | ||
We will be using GRCh38 for this case study. | ||
|
||
# Set up input variables | ||
REF="GRCh38_no_alt_analysis_set.fasta" | ||
BAM="HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam" | ||
THREADS=$(nproc) | ||
REGION="chr20" | ||
```bash | ||
mkdir -p reference | ||
|
||
# Set up output variable | ||
OUTPUT_VCF="HG003_PACBIO_SPRQ_GRCh38.chr20.output.vcf.gz" | ||
OUTPUT_GVCF="HG003_PACBIO_SPRQ_GRCh38.chr20.output.g.vcf.gz" | ||
INTERMEDIATE_DIRECTORY="intermediate_results_dir" | ||
# download and decompress | ||
curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > reference/GRCh38_no_alt_analysis_set.fasta | ||
|
||
mkdir -p "${OUTPUT_DIR}/${INTERMEDIATE_DIRECTORY}" | ||
# index reference | ||
samtools faidx reference/GRCh38_no_alt_analysis_set.fasta | ||
``` | ||
|
||
## Run DeepVariant | ||
### Download Genome in a Bottle Benchmarks | ||
|
||
We will run DeepVariant from docker using the `run_deepvariant` script. | ||
We will benchmark our variant calls against v4.2.1 of the Genome in a Bottle | ||
small variant benchmarks for HG003. | ||
|
||
```bash | ||
BIN_VERSION="1.8.0" | ||
|
||
sudo docker run \ | ||
-v "${INPUT_DIR}":"${INPUT_DIR}" \ | ||
-v "${OUTPUT_DIR}":"${OUTPUT_DIR}" \ | ||
google/deepvariant:"${BIN_VERSION}" \ | ||
/opt/deepvariant/bin/run_deepvariant \ | ||
--model_type PACBIO \ | ||
--ref "${INPUT_DIR}/${REF}" \ | ||
--reads "${INPUT_DIR}/${BAM}" \ | ||
--output_vcf "${OUTPUT_DIR}/${OUTPUT_VCF}" \ | ||
--output_gvcf "${OUTPUT_DIR}/${OUTPUT_GVCF}" \ | ||
--num_shards "${THREADS}" \ | ||
--regions "${REGION}" \ | ||
--intermediate_results_dir "${OUTPUT_DIR}/${INTERMEDIATE_DIRECTORY}" | ||
``` | ||
mkdir -p benchmark | ||
|
||
By specifying `--model_type PACBIO`, you'll be using a model that is best | ||
suited for PacBio data. | ||
|
||
NOTE: If you want to run each of the steps separately, add `--dry_run=true` to | ||
the command above to figure out what flags you need in each step. Based on the | ||
different model types, different flags are needed in the `make_examples` step. | ||
|
||
`--intermediate_results_dir` flag is optional. By specifying it, the | ||
intermediate outputs of `make_examples` and `call_variants` stages can be found | ||
in the directory. After the command, you can find these files in the directory: | ||
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38 | ||
|
||
``` | ||
call_variants_output.tfrecord.gz | ||
gvcf.tfrecord-?????-of-?????.gz | ||
make_examples.tfrecord-?????-of-?????.gz | ||
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed | ||
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | ||
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi | ||
``` | ||
|
||
## Benchmark HG003 chr20 output from DeepVariant | ||
### Download HG003 chr20 HiFi alignments | ||
|
||
We will use Genome-in-a-Bottle (GIAB) dataset to evaluate the performance of | ||
DeepVariant. | ||
We'll use HG003 chr20 HiFi reads publicly available from the [PrecisionFDA Truth v2 Challenge](https://precision.fda.gov/challenges/10). | ||
|
||
### Download Genome in a Bottle Benchmarks | ||
```bash | ||
mkdir -p input | ||
HTTPDIR=https://downloads.pacbcloud.com/public/dataset/HG003/deepvariant-case-study | ||
|
||
We will benchmark our variant calls against v4.2.1 of the Genome in a Bottle | ||
small variant benchmarks for HG003. | ||
curl ${HTTPDIR}/HG003.GRCh38.chr20.pFDA_truthv2.bam > input/HG003.GRCh38.chr20.pFDA_truthv2.bam | ||
curl ${HTTPDIR}/HG003.GRCh38.chr20.pFDA_truthv2.bam.bai > input/HG003.GRCh38.chr20.pFDA_truthv2.bam.bai | ||
``` | ||
|
||
## Run DeepVariant on chromosome 20 alignments | ||
|
||
```bash | ||
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38 | ||
ulimit -u 10000 # https://stackoverflow.com/questions/52026652/openblas-blas-thread-init-pthread-create-resource-temporarily-unavailable/54746150#54746150 | ||
BIN_VERSION="1.7.0" | ||
mkdir -p deepvariant_output | ||
|
||
singularity exec --bind /usr/lib/locale/ \ | ||
docker://google/deepvariant:${BIN_VERSION} \ | ||
/opt/deepvariant/bin/run_deepvariant \ | ||
--model_type PACBIO \ | ||
--ref reference/GRCh38_no_alt_analysis_set.fasta \ | ||
--reads input/HG003.GRCh38.chr20.pFDA_truthv2.bam \ | ||
--output_vcf deepvariant_output/output.vcf.gz \ | ||
--num_shards $(nproc) \ | ||
--regions chr20 | ||
``` | ||
|
||
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > ${INPUT_DIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed | ||
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > ${INPUT_DIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | ||
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > ${INPUT_DIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi | ||
NOTE: If you want to run each of the steps separately, add `--dry_run=true` | ||
to the command above to figure out what flags you need in each step. Based on | ||
the different model types, different flags are needed in the `make_examples` | ||
step. | ||
|
||
TRUTH_VCF="HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz" | ||
TRUTH_BED="HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed" | ||
``` | ||
## Benchmark output | ||
|
||
```bash | ||
sudo docker pull jmcdani20/hap.py:v0.3.12 | ||
|
||
sudo docker run \ | ||
-v "${INPUT_DIR}":"${INPUT_DIR}" \ | ||
-v "${OUTPUT_DIR}":"${OUTPUT_DIR}" \ | ||
-v "${PWD}/happy:/happy" \ | ||
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \ | ||
"${INPUT_DIR}/${TRUTH_VCF}" \ | ||
"${OUTPUT_DIR}/${OUTPUT_VCF}" \ | ||
-f "${INPUT_DIR}/${TRUTH_BED}" \ | ||
-r "${INPUT_DIR}/${REF}" \ | ||
-o "${OUTPUT_DIR}/hg003.pacbio.chr20.happy.output" \ | ||
--engine=vcfeval \ | ||
--pass-only \ | ||
-l "${REGION}" | ||
mkdir -p happy | ||
|
||
singularity exec docker://jmcdani20/hap.py:v0.3.12 \ | ||
/opt/hap.py/bin/hap.py \ | ||
--threads $(nproc) \ | ||
-r reference/GRCh38_no_alt_analysis_set.fasta \ | ||
-f benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \ | ||
-o happy/giab-comparison.v4.2.first_pass \ | ||
--engine=vcfeval \ | ||
--pass-only \ | ||
-l chr20 \ | ||
benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \ | ||
deepvariant_output/output.vcf.gz | ||
``` | ||
|
||
Output: | ||
|
||
``` | ||
Benchmarking Summary: | ||
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio | ||
INDEL ALL 10628 10543 85 22403 74 11375 40 29 0.992002 0.993290 0.507744 0.992646 NaN NaN 1.748961 2.138647 | ||
INDEL PASS 10628 10543 85 22403 74 11375 40 29 0.992002 0.993290 0.507744 0.992646 NaN NaN 1.748961 2.138647 | ||
SNP ALL 70166 70101 65 105602 71 35342 12 12 0.999074 0.998989 0.334672 0.999032 2.296566 1.713281 1.883951 1.503192 | ||
SNP PASS 70166 70101 65 105602 71 35342 12 12 0.999074 0.998989 0.334672 0.999032 2.296566 1.713281 1.883951 1.503192 | ||
INDEL ALL 10628 10551 77 22590 69 11527 39 29 0.992755 0.993763 0.510270 0.993259 NaN NaN 1.748961 2.275319 | ||
INDEL PASS 10628 10551 77 22590 69 11527 39 29 0.992755 0.993763 0.510270 0.993259 NaN NaN 1.748961 2.275319 | ||
SNP ALL 70166 70141 25 98780 23 28559 5 11 0.999644 0.999672 0.289117 0.999658 2.296566 1.823452 1.883951 1.913585 | ||
SNP PASS 70166 70141 25 98780 23 28559 5 11 0.999644 0.999672 0.289117 0.999658 2.296566 1.823452 1.883951 1.913585 | ||
``` |
Oops, something went wrong.