The data is hosted on DropBox, you can download it with the following wget
command:
wget https://www.dropbox.com/s/pwjj61dzojg2ajy/T2T_data.tar.gz?dl=1 -O T2T_data.tar.gz
Unpack the data:
tar -xzf T2T_data.tar.gz
There are two directories in T2T_data
:
alignments
: contains chr21 alignments for two samples on GRCh38 and CHM13v2references
: contains fastas, gffs, and corresponding indicies for GRCh38 and CHM13v2
CHM13v2.chr21.fasta
: Full genome fasta downloaded from https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.4. Then, runsamtools faidx <file> chr21
to isolate chr21.GRCh38.chr21.fasta
: Full genome fasta downloaded from https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40. Then, runsamtools faidx <file> chr21
to isolate chr21.CHM13v2.chr21.gencode.v35.gff3.gz
. Full GFF downloaded from https://github.com/marbl/CHM13#downloads (the Sorted GFF file underAssembly v2.0
. Then, runtabix -p gff <file>
andtabix <file> chr21
to isolate chr21.GRCh38.chr21.gencode.v35.gff3.gz
. Full GFF downloaded from https://www.gencodegenes.org/human/ under Comprehensive gene annotation. Then, runtabix -p gff <file>
andtabix <file> chr21
to isolate chr21.- The CHM13v2 cram files were downloaded from the T2T AnVil workspace here. Navigate to the "Data" tab at the top, and then the "Participant" data table via the sidebar on the left. The crams and cram indicies are in the
cram
andcram_index
columns, respectively. Then, runsamtools view -ShC -T <ref.fasta> <cram> chr21
to isolate chr21. WARNING: The CRAM files are quite large (upwards of 10GB) and require an AnVilpayline to download. - The GRCh38 cram files were downloaded from SRA here. You'll have to get the SRA accession for each sample, and navigate to the appropriate directory in SRA. For example, sample HG00116 has accession:
ERR3240132
; the path to the cram file is https://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3240132/. Then, runsamtools view -ShC -T <ref.fasta> <cram> chr21
to isolate chr21.
Start a screen
session to run CHM13v2 variant calling:
screen -S freebayes_CHM13v2
Within this screen session, run freebayes on the CHM13v2 alignments. We'll use time
to keep track of how long variant calling took:
time freebayes -f T2T_data/references/CHM13v2.chr21.fasta --genotype-qualities T2T_data/alignments/*.CHM13v2.chr21.cram > CHM13v2.chr21.vcf
Start another screen
session to run GRCh38 variant calling:
screen -S freebayes_GRCh38
Within this screen session, run freebayes on the GRCh38 alignments (using time
as before):
time freebayes -f T2T_data/references/GRCh38.chr21.fasta --genotype-qualities T2T_data/alignments/*.GRCh38.chr21.cram > GRCh38.chr21.vcf
Once freebayes
is done running, we shouldn't need these screen
sessions anymore.
Filter CHM13v2 vcf to variants with 99% probability:
vcffilter -f "QUAL > 20" CHM13v2.chr21.vcf > CHM13v2.chr21.filtered.vcf
Filter GRCh38 vcf to variants with 99% probability:
vcffilter -f "QUAL > 20" GRCh38.chr21.vcf > GRCh38.chr21.filtered.vcf
Zip the filtered CHM13v2 vcf:
bgzip CHM13v2.chr21.filtered.vcf
tabix -p vcf CHM13v2.chr21.filtered.vcf.gz
Zip the filtered GRCh38 vcf:
bgzip GRCh38.chr21.filtered.vcf
tabix -p vcf GRCh38.chr21.filtered.vcf.gz
Get alignment statistics for CHM13v2 bams:
samtools stats -r T2T_data/references/CHM13v2.chr21.fasta -@ 4 T2T_data/alignments/HG00116.CHM13v2.chr21.cram > HG00116.CHM13v2.chr21.cramstats.txt
samtools stats -r T2T_data/references/CHM13v2.chr21.fasta -@ 4 T2T_data/alignments/HG01509.CHM13v2.chr21.cram > HG01509.CHM13v2.chr21.cramstats.txt
Get alignment statistics for GRCh38 bams:
samtools stats -r T2T_data/references/GRCh38.chr21.fasta -@ 4 T2T_data/alignments/HG00116.GRCh38.chr21.cram > HG00116.GRCh38.chr21.cramstats.txt
samtools stats -r T2T_data/references/GRCh38.chr21.fasta -@ 4 T2T_data/alignments/HG01509.GRCh38.chr21.cram > HG01509.GRCh38.chr21.cramstats.txt
Get variant calling statistics for CHM13v2 vcf:
bcftools stats -s- CHM13v2.chr21.filtered.vcf.gz > CHM13v2.chr21.filtered.stats.txt
Get variant calling statistics for GRCh38 vcf:
bcftools stats -s- GRCh38.chr21.filtered.vcf.gz > GRCh38.chr21.filtered.stats.txt
Plot variant counts, as well as proper pairing, and mismatch rate stats for alignment, as shown in the analyze_stats.ipynb
notebook.
First, either scp
the crams to the local, or just wget
them straight to the local.
Using grep
determine the coordinates of KCNE1 from the Gencode gff files:
zgrep "KCNE1" CHM13v2.chr21.gencode.v35.gff3.gz | head -n 1 | cut -f 1,4,5
zgrep "KCNE1" GRCh38.chr21.gencode.v35.gff3.gz | head -n 1 | cut -f 1,4,5