Skip to content

Latest commit

 

History

History
66 lines (59 loc) · 5.31 KB

kfdrc-consensus-calling.md

File metadata and controls

66 lines (59 loc) · 5.31 KB

Kids First DRC Consensus Calling Workflow

This workflow is used by the Kids First (KF) Data Resource Center (DRC) to create consensus calls from outputs generated by our somatic variant callers.

data service logo

This workflow takes the protected vcf outputs from the Kids First DRC Somatic Workflow and creates protected and public consensus VCF and MAF files. Benchmarking of our SNV callers and consensus methods can be found here. The general outline is as follows:

  1. Prep MNP Variants
    • Strelka2 outputs multi-nucleotide polymorphisms (MNPs) as consecutive single-nucleotide polymorphisms
    • In order preserve MNPs, we gather MNP calls from the other caller inputs, and search for evidence supporting these consecutive SNP calls as MNP candidates
    • Once found, the Strelka2 SNP calls supporting a MNP are converted to a single MNP call
    • This is done to preserve the predicted gene model as accurately as possible in our consensus calls
  2. Consensus merge
    • Calls are gathered from all four callers
    • By default, calls with support from 2+ callers OR calls that are marked as HotSpotAllele in the INFO field are retained
    • Retained calls then have their MQ and MQ0 values calculated from the input tumor cram
    • GT fields are estimated as "majority rules," and when no majority exists, set as 0/1 by default
    • AD, DP, and AF are calculated as the average value between callers
    • ADR, DPR, and AFR fields are added as the range of values from the previous point, to give the observer a sense on confidence in the value
  3. VEP Annotate Consensus (see Kids First DRC Somatic Variant Annotation Workflow for details )
  4. Echtvar Annotation
    • Additional annotation is performed augment VEP annotation
    • While VEP does have extensive gnomad allele frequency annotation, it is limited to exome values. The added gnomad AF only resource we use augments this as an additional INFO/AF field to add WGS frequencies
  5. Soft filter variants
  6. VCF2MAF protected
    • Here, for convenience of analysis we convert the resultant, soft-filtered VCF (AKA, "Protected VCF") into MAF format
  7. Hard filter VCF
    • The Protected VCF is hard filtered on PASS and HotSpotAllele for reasons outlined in the Soft filter variants step
    • This VCF is known as the "Public VCF"
  8. VCF2MAF public
  9. Rename outputs

Workflow Description and KF Recommended Inputs

General workflow inputs, all file references can be obtained here:

  • indexed_reference_fasta: Homo_sapiens_assembly38.fasta
  • strelka2_vcf
  • mutect2_vcf
  • lancet_vcf
  • vardict_vcf
  • cram #Tumor cram recommended for MQ score calculation
  • input_tumor_name
  • input_normal_name
  • output_basename
  • tool_name: "consensus_somatic"
  • ncallers: # Optional number of callers required for consensus, recommend 2
  • consensus_ram: 3
  • annotation_zip: gnomad.v3.1.1.custom.echtvar.zip # population stats VCF for public filtering
  • vep_cache: homo_sapiens_merged_vep_105_indexed_GRCh38.tar.gz
  • gatk_filter_name: [NORM_DP_LOW, GNOMAD_AF_HIGH]
  • gatk_filter_expression: [ vc.getGenotype('insert_norm_sample_id_here').getDP() <= 7,gnomad_3_1_1_AF != '.' && gnomad_3_1_1_AF > 0.001 && && gnomad_3_1_1_FILTER=='PASS']
  • bcftools_public_filter: FILTER="PASS"|INFO/HotSpotAllele=1
  • retain_info: "gnomad_3_1_1_AC,gnomad_3_1_1_AN,gnomad_3_1_1_AF,gnomad_3_1_1_nhomalt,gnomad_3_1_1_AC_popmax,gnomad_3_1_1_AN_popmax,gnomad_3_1_1_AF_popmax,gnomad_3_1_1_nhomalt_popmax,gnomad_3_1_1_AC_controls_and_biobanks,gnomad_3_1_1_AN_controls_and_biobanks,gnomad_3_1_1_AF_controls_and_biobanks,gnomad_3_1_1_AF_non_cancer,gnomad_3_1_1_primate_ai_score,gnomad_3_1_1_splice_ai_consequence,gnomad_3_1_1_AF_non_cancer_afr,gnomad_3_1_1_AF_non_cancer_ami,gnomad_3_1_1_AF_non_cancer_asj,gnomad_3_1_1_AF_non_cancer_eas,gnomad_3_1_1_AF_non_cancer_fin,gnomad_3_1_1_AF_non_cancer_mid,gnomad_3_1_1_AF_non_cancer_nfe,gnomad_3_1_1_AF_non_cancer_oth,gnomad_3_1_1_AF_non_cancer_raw,gnomad_3_1_1_AF_non_cancer_sas,gnomad_3_1_1_AF_non_cancer_amr,gnomad_3_1_1_AF_non_cancer_popmax,gnomad_3_1_1_AF_non_cancer_all_popmax,gnomad_3_1_1_FILTER,MQ,MQ0,CAL,HotSpotAllele"
  • retain_fmt: # csv string with FORMAT fields that you want to keep
  • retain_ann: "HGVSg"
  • maf_center: "."
  • custom_enst: kf_isoform_override.tsv. As of VEP 104, several genes have had their canonical transcripts redefined. While the VCF will have all possible isoforms, this affects maf file output and may results in representative protein changes that defy historical expectations

Workflow outputs

  • annotated_protected_outputs: Array of files containing MAF format of PASS hits, PASS VCF with annotation pipeline soft FILTER-added values, and VCF index
  • annotated_public_outputs: Same as above, except MAF and VCF have had entries with soft FILTER values removed