This workflow is used by the Kids First (KF) Data Resource Center (DRC) to create consensus calls from outputs generated by our somatic variant callers.
This workflow takes the protected vcf outputs from the Kids First DRC Somatic Workflow and creates protected and public consensus VCF and MAF files. Benchmarking of our SNV callers and consensus methods can be found here. The general outline is as follows:
- Prep MNP Variants
- Strelka2 outputs multi-nucleotide polymorphisms (MNPs) as consecutive single-nucleotide polymorphisms
- In order preserve MNPs, we gather MNP calls from the other caller inputs, and search for evidence supporting these consecutive SNP calls as MNP candidates
- Once found, the Strelka2 SNP calls supporting a MNP are converted to a single MNP call
- This is done to preserve the predicted gene model as accurately as possible in our consensus calls
- Consensus merge
- Calls are gathered from all four callers
- By default, calls with support from 2+ callers OR calls that are marked as
HotSpotAllele
in theINFO
field are retained - Retained calls then have their
MQ
andMQ0
values calculated from the input tumor cram GT
fields are estimated as "majority rules," and when no majority exists, set as0/1
by defaultAD
,DP
, andAF
are calculated as the average value between callersADR
,DPR
, andAFR
fields are added as the range of values from the previous point, to give the observer a sense on confidence in the value
- VEP Annotate Consensus (see Kids First DRC Somatic Variant Annotation Workflow for details )
- Echtvar Annotation
- Additional annotation is performed augment VEP annotation
- While VEP does have extensive gnomad allele frequency annotation, it is limited to exome values. The added gnomad AF only resource we use augments this as an additional
INFO/AF
field to add WGS frequencies
- Soft filter variants
- A soft filter is added based on criteria provided
- By default, we perform soft filtering as outlined in the KFDRC Annotation Subworkflow
- VCF2MAF protected
- Here, for convenience of analysis we convert the resultant, soft-filtered VCF (AKA, "Protected VCF") into MAF format
- Hard filter VCF
- The Protected VCF is hard filtered on
PASS
andHotSpotAllele
for reasons outlined in theSoft filter variants
step - This VCF is known as the "Public VCF"
- The Protected VCF is hard filtered on
- VCF2MAF public
- Rename outputs
General workflow inputs, all file references can be obtained here:
- indexed_reference_fasta: Homo_sapiens_assembly38.fasta
- strelka2_vcf
- mutect2_vcf
- lancet_vcf
- vardict_vcf
- cram #Tumor cram recommended for MQ score calculation
- input_tumor_name
- input_normal_name
- output_basename
- tool_name: "consensus_somatic"
- ncallers: # Optional number of callers required for consensus, recommend
2
- consensus_ram:
3
- annotation_zip: gnomad.v3.1.1.custom.echtvar.zip # population stats VCF for public filtering
- vep_cache: homo_sapiens_merged_vep_105_indexed_GRCh38.tar.gz
- gatk_filter_name:
[NORM_DP_LOW, GNOMAD_AF_HIGH]
- gatk_filter_expression:
[ vc.getGenotype('
insert_norm_sample_id_here').getDP() <= 7,gnomad_3_1_1_AF != '.' && gnomad_3_1_1_AF > 0.001 && && gnomad_3_1_1_FILTER=='PASS']
- bcftools_public_filter:
FILTER="PASS"|INFO/HotSpotAllele=1
- retain_info: "gnomad_3_1_1_AC,gnomad_3_1_1_AN,gnomad_3_1_1_AF,gnomad_3_1_1_nhomalt,gnomad_3_1_1_AC_popmax,gnomad_3_1_1_AN_popmax,gnomad_3_1_1_AF_popmax,gnomad_3_1_1_nhomalt_popmax,gnomad_3_1_1_AC_controls_and_biobanks,gnomad_3_1_1_AN_controls_and_biobanks,gnomad_3_1_1_AF_controls_and_biobanks,gnomad_3_1_1_AF_non_cancer,gnomad_3_1_1_primate_ai_score,gnomad_3_1_1_splice_ai_consequence,gnomad_3_1_1_AF_non_cancer_afr,gnomad_3_1_1_AF_non_cancer_ami,gnomad_3_1_1_AF_non_cancer_asj,gnomad_3_1_1_AF_non_cancer_eas,gnomad_3_1_1_AF_non_cancer_fin,gnomad_3_1_1_AF_non_cancer_mid,gnomad_3_1_1_AF_non_cancer_nfe,gnomad_3_1_1_AF_non_cancer_oth,gnomad_3_1_1_AF_non_cancer_raw,gnomad_3_1_1_AF_non_cancer_sas,gnomad_3_1_1_AF_non_cancer_amr,gnomad_3_1_1_AF_non_cancer_popmax,gnomad_3_1_1_AF_non_cancer_all_popmax,gnomad_3_1_1_FILTER,MQ,MQ0,CAL,HotSpotAllele"
- retain_fmt: # csv string with FORMAT fields that you want to keep
- retain_ann: "HGVSg"
- maf_center: "."
custom_enst
:kf_isoform_override.tsv
. As of VEP 104, several genes have had their canonical transcripts redefined. While the VCF will have all possible isoforms, this affects maf file output and may results in representative protein changes that defy historical expectations
annotated_protected_outputs
: Array of files containing MAF format of PASS hits,PASS
VCF with annotation pipeline softFILTER
-added values, and VCF indexannotated_public_outputs
: Same as above, except MAF and VCF have had entries with softFILTER
values removed