Skip to content

Illumina ReadMe

MikeWLloyd edited this page May 28, 2024 · 4 revisions

MMRSVD Germline Structural Variant (SV): Illumina Short-Read Data Documentation

SV Analysis Pipeline: Illumina Short-Read Data

(--workflow germline_sv, --data_type illumina)

For input sample:

•   Fastp read quality and adapter trimming   
•   Get Read Group Information   
•   BWA-MEM Alignment   
•   Samtools SortSam and GATK Mark Duplicates    
•   Collect Alignment Summary Metrics   
•   Smoove SV calling and Bcftools reheader variant sorting 
•	Manta SV calling and Bcftools reheader variant sorting   
•	Delly SV calling and Bcftools reheader variant sorting   
•	Delly CNV calling and Bcftools reheader variant sorting   
•	GATK Haplotype Calling
•	VEP annotation of Haplotype Called GVCF
•	VEP annotation of sorted CNVs  
•	Duphold annotation with Bam, SVs and SNPs/INDELs  
•   Survivor Merge of Duphold annotated VCFs   
•   Collect Survivor merged VCFs Summary Metrics   
•   Survivor merged VCFs to Table and Survivor to BEDs  
•   Bedtools intersect of Survivor BEDs
•   Survivor annotation VCF with Exons   

Illumina Flowchart

flowchart TD
    p00([ILLUMINA READS\nFASTQ])
    p01[FASTP]
    p02[BWA_MEM]
    m01[ALIGNED_BAM]
    p03[SAMTOOLS_SORTSAM]
    p04[GATK_MARKDUPLICATES]
    p05[SAMTOOLS_STATS]
    p06[SMOOVE_SV_CALL & \nBCFTOOLS_REHEADER_SORT]
    p07[MANTA_SV_CALL & \nBCFTOOLS_REHEADER_SORT]
    p08[DELLY_SV_CALL & \nBCFTOOLS_REHEADER_SORT]
    p09[DELLY_CNV_CALL & \nBCFTOOLS_REHEADER_SORT]
    p10[GATK_HAPLOTYPE_CALL]
    p11[VEP_ANNOTATION\nGVCF]
    p12[VEP_ANNOTATION\nCNVS]
    p13[DUPHOLD_ANNOTATION_DELLY]
    p14[DUPHOLD_ANNOTATION_MANTA]
    p15[DUPHOLD_ANNOTATION_LUMPY]
    p16[SURVIVOR_MERGE]
    p17[SURVIVOR_SUMMARY]
    p18[SURVIVOR_VCF_TO_TABLE]
    p19[SURVIVOR_TO_BED]
    p20[SURVIVOR_BED_INTERSECT]
    p21[SURVIVOR_ANNOTATION]
    p22[SURVIVOR_ANNOTATION_WITH_EXONS]
    p00 --> p01
    p01 --> p02
    p02 --> p03
    p03 --> p04
    m01 -..-> |If Pre-Aligned Bam Provided| p04
    o1 --> p05
    o1 --> p06
    o1 --> p07
    o1 --> p08
    o1 --> p09
    p09 --> p12
    o1 --> p10
    o2 --> p11
    p11 --> p13
    o9 --> p14
    o2 --> p14
    p11 --> p14
    o10 --> p15
    p10 --> o2
    o2 --> p15
    p11 --> p15
    p13 --> p16
    p14 --> p16
    p15 --> p16
    p16 --> o4
    o4 --> p17
    o4 --> p18
    p17 --> p19
    p18 --> p19
    p19 --> p20
    p20 --> o11
    p19 --> p21
    o11 --> p21
    p17 --> p21
    p18 --> p21
    o4 --> p22
    o11 --> p22
    o1([Genomic BAM]):::output
    o2([Raw Variant Calls]):::output
    o3([Alignment Stats]):::output
    o4([Merged VCF]):::output
    o5([Annotated SV Calls]):::output
    o6([SV Joined Results]):::output
    o7([DELLY SV Calls]):::output
    o8([Annotated CNVs]):::output
    o9([MANTA SV Calls]):::output
    o10([SMOOVE SV Calls]):::output
    o11([Intersect BEDS]):::output
    p04 --> o1
    p05 --> o3
    p21 --> o6
    p22 --> o5
    p08 --> o7
    p12 --> o8
    p07 --> o9
    p06 --> o10
    classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
Loading

Parameters for MMRSVD Germline SV Pipeline (illumina)

  • --sampleID

    • Default: <STRING>
    • Comment: The sample ID for the input data (required).
  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • --organize_by

    • Default: sample
    • Comment: How to organize the output folder structure. Options: sample or analysis.
  • --cacheDir

    • Default: '/projects/omics_share/meta/containers'
    • Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
  • -w

    • Default: /<PATH>
    • Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
  • --data_type

    • Default: null
    • Comment: Options: illumina or pacbio, or ont.
  • --read_type

    • Default: PE
    • Comment: Options: PE and SE. Default: PE. Type of reads: paired end (PE) or single end (SE).
  • --csv_input

    • Default: null
    • Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, * --download_data can be specified.
  • --fastq1

    • Default: null
    • Comment: The path to a single FASTQ file, or one of a pair of FASTQs for paired-end data.
  • --fastq2

    • Default: null
    • Comment: The path to the second of a pair of FASTQs for paired-end data.
  • --bam

    • Default: null
    • Comment: The path to a BAM input data if alignment has already been performed outside this pipeline.
  • --ref_fa

    • Default: /<PATH>
    • Comment: The path to the reference genome in FASTA format.
  • --bwa_index

    • Default: /<PATH>
    • Comment: Optional paramter to specify BWA indices for alignment. If not provided, pipeline will generate these indices.
  • --genome_build

    • Default: GRCm38
    • Comment: Mouse specific. Options: GRCm38 or GRCm39. Parameter that controls reference data used for alignment and annotation.
  • --exclude_regions

    • Default: '/ref_data/ucsc_mm10_gap_chr_sorted.bed'
    • Comment: BED file that lists the coordinates of centromeres and telomeres to exclude as alignment targets. Note: default path refers to a location within the containers quay.io/jaxcompsci/lumpy-ref_data:0.3.1--refv0.2.0and quay.io/jaxcompsci/delly-ref_data:1.1.6--refv0.2.0, which require this file.
  • --sv_ins_ref

    • Default: '/ref_data/variants_freeze5_sv_INS_mm39_to_mm10_sorted.bed.gz'
    • Comment: BED file that lists previously indentified insertion SVs. Note: default path refers to a location within the container quay.io/jaxcompsci/bedtools-sv_refs:2.30.0--refv0.2.0, which requires this file.
  • --sv_del_ref

    • Default: '/ref_data/variants_freeze5_sv_DEL_mm39_to_mm10_sorted.bed.gz'
    • BED file that lists previously indentified deletion SVs. Note: default path refers to a location within the container quay.io/jaxcompsci/bedtools-sv_refs:2.30.0--refv0.2.0, which requires this file.
  • --sv_inv_ref

    • Default: '/ref_data/variants_freeze5_sv_INV_mm39_to_mm10_sorted.bed.gz'
    • BED file that lists previously indentified inversion SVs. Note: default path refers to a location within the container quay.io/jaxcompsci/bedtools-sv_refs:2.30.0--refv0.2.0, which requires this file.
  • --reg_ref

    • Default: '/ref_data/mus_musculus.GRCm38.Regulatory_Build.regulatory_features.20180516.gff.gz'
    • BED file that lists regulatory features. Note: default path refers to a location within the container quay.io/jaxcompsci/bedtools-sv_refs:2.30.0--refv0.2.0, which requires this file.
  • --genes_bed

    • Default: '/ref_data/Mus_musculus.GRCm38.102.gene_symbol.bed'
    • BED file that lists gene symbol IDs and coordinates. Note: default path refers to a location within the container quay.io/jaxcompsci/bedtools-sv_refs:2.30.0--refv0.2.0, which requires this file.
  • --exons_bed

    • Default: '/ref_data/Mus_musculus.GRCm38.102.exons.bed'
    • BED file that lists exons and coordinates. Note: default path refers to a location within the container quay.io/jaxcompsci/bedtools-sv_refs:2.30.0--refv0.2.0, which requires this file.
  • --quality_phred

    • Default: 30
    • Quality score threshold.
  • --unqualified_perc

    • Default: 30
    • Percent threhold of unqualified bases to pass reads.
  • --surv_dist

    • Default: 1000
    • Maximum distance between breakpoints for merging SVs.
  • --surv_supp

    • Default: 1
    • The number of callers (out of 4) required to support an SV.
  • --surv_type

    • Default: 1
    • Boolean (0/1) that requires SVs to be the same type for merging.
  • --surv_strand

    • Default: 1
    • Boolean (0/1) that requires SVs to be on the same strand for merging.
  • --surv_min

    • Default: 30
    • Minimum length (bp) to output SVs.

Pipeline Default Outputs

Naming Convention Description
germline_sv_report.html Nextflow autogenerated report
trace/trace.txt Nextflow trace of processes
${sampleID}/${sampleID}_ILLUMINA_DLM_struct_var.vcf VCF output combining merged Delly, Lumpy, and Manta calls annotated for overlap with exonic regions
${sampleID}/${sampleID}_survivor_joined_results.csv Table of SVs annotated with overlaps of previously identified SVs (beck), genes, exons, regulatory regions
${sampleID}/stats/${sampleID}_fastp_report.html Filtering and trimming report from fastp
${sampleID}/alignments/${sampleID}.md.bam Analysis-ready alignment of reads
${sampleID}/alignments/${sampleID}.md.bai Index for analysis-ready alignment of reads
${sampleID}/alignments/${sampleID}.md.metrics GATK MarkDuplicates log
${sampleID}/alignments/${sampleID}.insert_size.txt Inferred read insert size
${sampleID}/unmerged_calls/${sampleID}_dellySort.vcf SV calls from Delly
${sampleID}/unmerged_calls/${sampleID}_lumpySort.vcf SV calls from Lumpy
${sampleID}/unmerged_calls/${sampleID}_mantaSort.vcf SV calls from Manta
Clone this wiki locally