-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME_vcf_info_annotation.20141104
51 lines (32 loc) · 2.85 KB
/
README_vcf_info_annotation.20141104
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
The initially released phase3 VCF files contain variant sites, per sample genotypes and some basic information such
as known dbSNP rs number if there is one, AN, AC and global allele frequency (AF). Additional supporting evidence
and annotations for variants were added in subsequent updates. Here is a brief description of such annotations.
Annotations added to the VCFs in the main release directory (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502):
1. Total depth in all samples
Read depth for each sample at each variant site was calculated from sample-level BAM files using multicov function in
BedTools (http://bedtools.readthedocs.org/en/latest/content/tools/multicov.html). The total depth for all samples in
the VCF file was reported in the INFO field for each site.
For complicated events such as deletion and insertion, the depth is calculated for the base immediate 5' to the event.
Paths of BAM files used in this calculation are summarised in
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/alignment_indices/20130502.low_coverage.alignment.index
Below is an example command line used for calculating read depth for one sample:
BEDTools/bin/multiBamCov -bams HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam -bed ALL.autosomes.phase3_shapeit2_mvncall_integrated_v4.20130502.sites.vcf.gz
This value is represented in the INFO column using the DP info tag
For chrY, the DP information for each site was calculated differently during the variant calling process.
2. Allele frequency by continental super population
The 2504 samples in the phase3 release are from 26 populations which can be categorised into five super-populations
by continent (listed below). As well as the global AF in the INFO field. We added AF for each super-population to the INFO field.
East Asian EAS
South Asian SAS
African AFR
European EUR
American AMR
These allele frequences were calculated by counting the AC and AN for all the individuals from a particular super population and using that
to calculate the AF. The info tag which represents the AFs are EAS_AF, EUR_AF, AFR_AF, AMR_AF and SAS_AF
The super population assignment for each sample can be found in integrated_call_samples_v3.20130502.ALL.panel
AF for multi-allelic variants are reported for each allele independently, separated by ",".
3. Ancestral Allele (AA)
Ancestral sequences are inferred from Ensembl multiple alignments using Ortheus. Ortheus is a probabilistic method for the inference of ancestor, a.k.a tree, alignments.
The main contribution of Ortheus is the use of a phylogenetic model incorporating gaps to infer insertion and deletion events. Ancestral sequences are predicted for each node
of the phylogenetic tree that relates the sequences.
The AA for chrY variant sites were derived with a separate process from that of the autosomes. Please see README_phase3_chrY_calls_20141104 for details.