Skip to content

Commit

Permalink
Internal update
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 702021003
  • Loading branch information
kishwarshafin authored and pichuan committed Dec 3, 2024
1 parent 08e1d54 commit 0551e97
Show file tree
Hide file tree
Showing 16 changed files with 272 additions and 541 deletions.
2 changes: 0 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,6 @@ docker run \
--output_vcf=/output/YOUR_OUTPUT_VCF \
--output_gvcf=/output/YOUR_OUTPUT_GVCF \
--num_shards=$(nproc) \ **This will use all your cores to run make_examples. Feel free to change.**
--vcf_stats_report=true \ **Optional. Creates VCF statistics report in html file. Default is false.
--disable_small_model=true \ **Optional. Disables the small model from make_examples stage. Default is false.
--logging_dir=/output/logs \ **Optional. This saves the log output for each stage separately.
--haploid_contigs="chrX,chrY" \ **Optional. Heterozygous variants in these contigs will be re-genotyped as the most likely of reference or homozygous alternates. For a sample with karyotype XY, it should be set to "chrX,chrY" for GRCh38 and "X,Y" for GRCh37. For a sample with karyotype XX, this should not be used.
--par_regions_bed="/input/GRCh3X_par.bed" \ **Optional. If --haploid_contigs is set, then this can be used to provide PAR regions to be excluded from genotype adjustment. Download links to this files are available in this page.
Expand Down
5 changes: 2 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,15 @@
* [DeepVariant hybrid (PacBio and Illumina) case study](deepvariant-hybrid-case-study.md)
* [DeepVariant Complete Genomics T7 case study](deepvariant-complete-t7-case-study.md)
* [DeepVariant Complete Genomics G400 case study](deepvariant-complete-g400-case-study.md)
* [RNA-seq Case Study](deepvariant-rnaseq-case-study.md) for Illumina RNA-seq.
* [RNA-seq Case Study](deepvariant-rnaseq-case-study.md) for Illumina
RNA-seq.
* [PacBio Iso-Seq/MAS-Seq Case Study](deepvariant-masseq-case-study.md).
* [Runtime and accuracy metrics for all DeepVariant models](metrics.md)
* [Best practices for multi-sample variant calling](trio-merge-case-study.md)
* [Using graph genomes: VG Giraffe + DeepVariant case study](deepvariant-vg-case-study.md)
* Pangenome-aware DeepVariant WGS:
[Mapped with BWA](pangenome-aware-wgs-bwa-case-study.md),
[Mapped with VG](pangenome-aware-wgs-vg-case-study.md)
* Pangenome-aware DeepVariant WES:
[Mapped with BWA](pangenome-aware-wes-bwa-case-study.md)

## Visualization and analysis

Expand Down
39 changes: 15 additions & 24 deletions docs/deepvariant-fast-pipeline-case-study.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,21 +23,16 @@ Here we create Google Cloud compute instance. You may skip this step if you run
the case study on a local computer with GPU.

```bash
gcloud compute instances create "deepvariant-fast-pipeline" \
gcloud compute instances create "deepvariant-casestudy" \
--scopes "compute-rw,storage-full,cloud-platform" \
--maintenance-policy "TERMINATE" \
--accelerator=type=nvidia-tesla-p4,count=1 \
--image-family "ubuntu-2204-lts" \
--image-project "ubuntu-os-cloud" \
--machine-type "n1-standard-16" \
--boot-disk-size "100" \
--zone "us-central1-a"
```

You can then ssh into the machine by running:

```bash
gcloud compute ssh "deepvariant-fast-pipeline" --zone us-central1-a
--zone "us-central1-a" \
--min-cpu-platform "Intel Skylake"
```

## Install Nvidia drivers and Nvidia container toolkit (optional)
Expand All @@ -52,7 +47,6 @@ For this case study we used the
that automates the CUDA and container tools kit installation.

Please note that the script takes about 30 minutes to run.

```bash
wget https://raw.githubusercontent.com/google/deepvariant/refs/heads/r1.8.0/scripts/install_nvidia_docker.sh
chmod +x install_nvidia_docker.sh
Expand All @@ -64,7 +58,7 @@ chmod +x install_nvidia_docker.sh
### Get DeepVariant Docker image

```bash
BIN_VERSION="1.8.0"
BIN_VERSION="1.8.0-rc0"
sudo docker pull google/deepvariant:"${BIN_VERSION}-gpu"
```

Expand All @@ -86,7 +80,7 @@ gcloud storage cp gs://deepvariant/case-study-testdata/GCA_000001405.15_GRCh38_n

```bash
mkdir -p input
gcloud storage cp gs://deepvariant/pacbio-case-study-testdata/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam* input/
gcloud storage cp gs://deepvariant/pacbio-case-study-testdata/HG003.pfda_challenge.grch38.phased.chr20.bam* input/
```

## Run DeepVariant pipeline on chromosome 20 alignments
Expand Down Expand Up @@ -116,7 +110,7 @@ cat <<EOM >$FILE
--examples=/tmp/[email protected]
--gvcf=/tmp/[email protected]
--mode=calling
--reads=/input/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam
--reads=/input/HG003.pfda_challenge.grch38.phased.chr20.bam
--ref=/reference/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
--alt_aligned_pileup=diff_channels
--max_reads_per_partition=600
Expand Down Expand Up @@ -162,7 +156,7 @@ cat <<EOM >$FILE
--outfile=/output/variants.chr20.vcf
--gvcf_outfile=/output/variants.gvcf.chr20.vcf
--small_model_cvo_records=/tmp/[email protected]
--cpus=14
--cpus=0
EOM
```

Expand All @@ -184,7 +178,7 @@ time sudo docker run \
--shm_prefix dv \
--num_shards 14 \
--buffer_size 10485760 \
2>&1 | tee /tmp/fast_pipeline.docker.log
2>&1 | tee /tmp/fast_pipeline.Docker_chr20.log
```

* `-v` allows to map local directory inside docker container.
Expand Down Expand Up @@ -217,9 +211,9 @@ variants.gvcf.chr20.vcf
With the same settings the pipeline takes approximately 10 minutes.

```
real 8m15.252s
user 0m0.007s
sys 0m0.035s
real 10m35.875s
user 0m0.009s
sys 0m0.034s
```

## Benchmark output
Expand All @@ -236,8 +230,6 @@ curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_G

```
HAPPY_VERSION=v0.3.12
time sudo docker run \
-v ${PWD}/output:/output \
-v ${PWD}/benchmark:/benchmark \
-v ${PWD}/reference:/reference \
Expand All @@ -254,10 +246,9 @@ time sudo docker run \
```

```
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10628 10543 85 22403 74 11375 40 29 0.992002 0.993290 0.507744 0.992646 NaN NaN 1.748961 2.138647
INDEL PASS 10628 10543 85 22403 74 11375 40 29 0.992002 0.993290 0.507744 0.992646 NaN NaN 1.748961 2.138647
SNP ALL 70166 70101 65 105602 71 35342 12 12 0.999074 0.998989 0.334672 0.999032 2.296566 1.713281 1.883951 1.503192
SNP PASS 70166 70101 65 105602 71 35342 12 12 0.999074 0.998989 0.334672 0.999032 2.296566 1.713281 1.883951 1.503192
INDEL ALL 10628 10560 68 22592 70 11520 39 30 0.993602 0.993678 0.509915 0.993640 NaN NaN 1.748961 2.324711
INDEL PASS 10628 10560 68 22592 70 11520 39 30 0.993602 0.993678 0.509915 0.993640 NaN NaN 1.748961 2.324711
SNP ALL 70166 70142 24 104271 23 34047 7 5 0.999658 0.999672 0.326524 0.999665 2.296566 1.74197 1.883951 1.849802
SNP PASS 70166 70142 24 104271 23 34047 7 5 0.999658 0.999672 0.326524 0.999665 2.296566 1.74197 1.883951 1.849802
```
196 changes: 85 additions & 111 deletions docs/deepvariant-pacbio-model-case-study.md
Original file line number Diff line number Diff line change
@@ -1,154 +1,128 @@
# DeepVariant with PacBio HiFi data
# Using DeepVariant for small variant calling from PacBio HiFi reads

#### Author: William Rowell <[email protected]>

In this case study we describe applying DeepVariant to PacBio HiFi reads to call
variants. We will call small variants from a publicly available whole genome
HiFi dataset from PacBio.

### Updated dataset in release 1.8.0

In release 1.8.0, we have updated the PacBio test data from HG003 Sequel-II to
latest Revio with SPRQ chemistry data to showcase performance on the updated
platform and chemistry. The full bam data is available [here](https://downloads.pacbcloud.com/public/revio/2024Q4/WGS/GIAB_trio/HG003/analysis/GRCh38.m84039_241002_000337_s3.hifi_reads.bc2020.bam).

The dataset used in this case-study has following attributes:

```bash
Sample: HG003
Region: Chr20
Chemistry: REVIO SPRQ
Coverage: 32x
```
Starting in v1.4.0, PacBio calling uses one-step variant calling. If you're
looking for documentation for the two-step process, please look at v1.3.0.

## Prepare environment

In this case-study, we will use [Docker](https://docs.docker.com/get-docker/) to
run DeepVariant for variant calling and
[hap.py](https://github.com/illumina/hap.py) for benchmarking.
### Tools

If you want to run on GPU machines, or use `Singularity` instead of `Docker`,
please follow [Quick Start](deepvariant-quick-start.md) documentation.
[Singularity](https://sylabs.io/docs/) will be used to run DeepVariant and
[hap.py](https://github.com/illumina/hap.py), and we'll use
[miniconda](https://docs.conda.io/en/latest/miniconda.html) and a conda
environment to handle the other dependencies for the case study and samtools.

### Create input and output directory structures and download inputs
- singularity (must be installed by `root` user; outside of the scope of this
case study)
- samtools

```bash
BASE="${HOME}/pacbio-case-study"

# Set up input and output directory data
INPUT_DIR="${BASE}/input/data"
OUTPUT_DIR="${BASE}/output"

## Create local directory structure
mkdir -p "${INPUT_DIR}"
mkdir -p "${OUTPUT_DIR}"
# add channels to conda configuration
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

# create the environment and install dependencies
conda create -y -n deepvariant_env
conda activate deepvariant_env
conda install -y samtools==1.10
```

# Download reference to input directory
FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > ${INPUT_DIR}/GRCh38_no_alt_analysis_set.fasta
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > ${INPUT_DIR}/GRCh38_no_alt_analysis_set.fasta.fai
### Download Reference

HTTPDIR=https://storage.googleapis.com/deepvariant/pacbio-case-study-testdata
curl ${HTTPDIR}/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam > ${INPUT_DIR}/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam
curl ${HTTPDIR}/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam.bai > ${INPUT_DIR}/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam.bai
We will be using GRCh38 for this case study.

# Set up input variables
REF="GRCh38_no_alt_analysis_set.fasta"
BAM="HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam"
THREADS=$(nproc)
REGION="chr20"
```bash
mkdir -p reference

# Set up output variable
OUTPUT_VCF="HG003_PACBIO_SPRQ_GRCh38.chr20.output.vcf.gz"
OUTPUT_GVCF="HG003_PACBIO_SPRQ_GRCh38.chr20.output.g.vcf.gz"
INTERMEDIATE_DIRECTORY="intermediate_results_dir"
# download and decompress
curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > reference/GRCh38_no_alt_analysis_set.fasta

mkdir -p "${OUTPUT_DIR}/${INTERMEDIATE_DIRECTORY}"
# index reference
samtools faidx reference/GRCh38_no_alt_analysis_set.fasta
```

## Run DeepVariant
### Download Genome in a Bottle Benchmarks

We will run DeepVariant from docker using the `run_deepvariant` script.
We will benchmark our variant calls against v4.2.1 of the Genome in a Bottle
small variant benchmarks for HG003.

```bash
BIN_VERSION="1.8.0"

sudo docker run \
-v "${INPUT_DIR}":"${INPUT_DIR}" \
-v "${OUTPUT_DIR}":"${OUTPUT_DIR}" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type PACBIO \
--ref "${INPUT_DIR}/${REF}" \
--reads "${INPUT_DIR}/${BAM}" \
--output_vcf "${OUTPUT_DIR}/${OUTPUT_VCF}" \
--output_gvcf "${OUTPUT_DIR}/${OUTPUT_GVCF}" \
--num_shards "${THREADS}" \
--regions "${REGION}" \
--intermediate_results_dir "${OUTPUT_DIR}/${INTERMEDIATE_DIRECTORY}"
```
mkdir -p benchmark

By specifying `--model_type PACBIO`, you'll be using a model that is best
suited for PacBio data.

NOTE: If you want to run each of the steps separately, add `--dry_run=true` to
the command above to figure out what flags you need in each step. Based on the
different model types, different flags are needed in the `make_examples` step.

`--intermediate_results_dir` flag is optional. By specifying it, the
intermediate outputs of `make_examples` and `call_variants` stages can be found
in the directory. After the command, you can find these files in the directory:
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38

```
call_variants_output.tfrecord.gz
gvcf.tfrecord-?????-of-?????.gz
make_examples.tfrecord-?????-of-?????.gz
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
```

## Benchmark HG003 chr20 output from DeepVariant
### Download HG003 chr20 HiFi alignments

We will use Genome-in-a-Bottle (GIAB) dataset to evaluate the performance of
DeepVariant.
We'll use HG003 chr20 HiFi reads publicly available from the [PrecisionFDA Truth v2 Challenge](https://precision.fda.gov/challenges/10).

### Download Genome in a Bottle Benchmarks
```bash
mkdir -p input
HTTPDIR=https://downloads.pacbcloud.com/public/dataset/HG003/deepvariant-case-study

We will benchmark our variant calls against v4.2.1 of the Genome in a Bottle
small variant benchmarks for HG003.
curl ${HTTPDIR}/HG003.GRCh38.chr20.pFDA_truthv2.bam > input/HG003.GRCh38.chr20.pFDA_truthv2.bam
curl ${HTTPDIR}/HG003.GRCh38.chr20.pFDA_truthv2.bam.bai > input/HG003.GRCh38.chr20.pFDA_truthv2.bam.bai
```

## Run DeepVariant on chromosome 20 alignments

```bash
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38
ulimit -u 10000 # https://stackoverflow.com/questions/52026652/openblas-blas-thread-init-pthread-create-resource-temporarily-unavailable/54746150#54746150
BIN_VERSION="1.7.0"
mkdir -p deepvariant_output

singularity exec --bind /usr/lib/locale/ \
docker://google/deepvariant:${BIN_VERSION} \
/opt/deepvariant/bin/run_deepvariant \
--model_type PACBIO \
--ref reference/GRCh38_no_alt_analysis_set.fasta \
--reads input/HG003.GRCh38.chr20.pFDA_truthv2.bam \
--output_vcf deepvariant_output/output.vcf.gz \
--num_shards $(nproc) \
--regions chr20
```

curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > ${INPUT_DIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > ${INPUT_DIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > ${INPUT_DIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
NOTE: If you want to run each of the steps separately, add `--dry_run=true`
to the command above to figure out what flags you need in each step. Based on
the different model types, different flags are needed in the `make_examples`
step.

TRUTH_VCF="HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz"
TRUTH_BED="HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed"
```
## Benchmark output

```bash
sudo docker pull jmcdani20/hap.py:v0.3.12

sudo docker run \
-v "${INPUT_DIR}":"${INPUT_DIR}" \
-v "${OUTPUT_DIR}":"${OUTPUT_DIR}" \
-v "${PWD}/happy:/happy" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
"${INPUT_DIR}/${TRUTH_VCF}" \
"${OUTPUT_DIR}/${OUTPUT_VCF}" \
-f "${INPUT_DIR}/${TRUTH_BED}" \
-r "${INPUT_DIR}/${REF}" \
-o "${OUTPUT_DIR}/hg003.pacbio.chr20.happy.output" \
--engine=vcfeval \
--pass-only \
-l "${REGION}"
mkdir -p happy

singularity exec docker://jmcdani20/hap.py:v0.3.12 \
/opt/hap.py/bin/hap.py \
--threads $(nproc) \
-r reference/GRCh38_no_alt_analysis_set.fasta \
-f benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-o happy/giab-comparison.v4.2.first_pass \
--engine=vcfeval \
--pass-only \
-l chr20 \
benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
deepvariant_output/output.vcf.gz
```

Output:

```
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10628 10543 85 22403 74 11375 40 29 0.992002 0.993290 0.507744 0.992646 NaN NaN 1.748961 2.138647
INDEL PASS 10628 10543 85 22403 74 11375 40 29 0.992002 0.993290 0.507744 0.992646 NaN NaN 1.748961 2.138647
SNP ALL 70166 70101 65 105602 71 35342 12 12 0.999074 0.998989 0.334672 0.999032 2.296566 1.713281 1.883951 1.503192
SNP PASS 70166 70101 65 105602 71 35342 12 12 0.999074 0.998989 0.334672 0.999032 2.296566 1.713281 1.883951 1.503192
INDEL ALL 10628 10551 77 22590 69 11527 39 29 0.992755 0.993763 0.510270 0.993259 NaN NaN 1.748961 2.275319
INDEL PASS 10628 10551 77 22590 69 11527 39 29 0.992755 0.993763 0.510270 0.993259 NaN NaN 1.748961 2.275319
SNP ALL 70166 70141 25 98780 23 28559 5 11 0.999644 0.999672 0.289117 0.999658 2.296566 1.823452 1.883951 1.913585
SNP PASS 70166 70141 25 98780 23 28559 5 11 0.999644 0.999672 0.289117 0.999658 2.296566 1.823452 1.883951 1.913585
```
Loading

0 comments on commit 0551e97

Please sign in to comment.