Case study 1: UK Biobank, setting up metaCCA analysis

Current location: project + working/data/

Plink (on epi-franklin): $plink = /data/ny19205/software/plink/plink

Selecting traits (IEU-GWAS db)

See scripts/select_traits

To select traits I used the reference table study-table-13-02-19_UKBonly.tsv from UKB data (biobank_traits_parser.R).

UKB traits:

* BMI: 19953 
* Waist cirmunverence: 9405
* Whole body fat mass: 19393
* Whole body fat-free mass: 13354
* Body fat percentage: 8909
* Systolic blood pressure (automated): 20175
* Diastolic blood pressure (automated): 7992

UK biobank SNP list (the same for all traits): 9,837,128

Preparing the input data

The data for selected IEU UKB studies is available in VCF format here: /mnt/storage/private/mrcieu/research/scratch/IGD/data/public/ (BC4)

UKB IEU data was parsed using bash script parse_gwas_vcf.sh, which can be used on other datasets available in VCF. The script was run on BC4 in /mnt/storage/scratch/ny19205/, then clean tsvs were moved to project space to data/S_XY_matrices/ukb and data/S_XY_matrices/ukb.

Reference data

Create a list of SNPs in UKB

cd snps_list/

less UKB-b-13354_subset.tsv | cut -f3,4,5 | sed 's/\t/_/g' | sort | uniq > ukb_common_snps.txt

# save as range	
cat ukb_common_snps.txt | sed 's/_/\t/g'| awk '{print $1"\t"$2"\t"$2"\t"$3}' > ukb_common_snps_chrpos.txt
wc -l ukb_common_snps_chrpos.txt
	9837128
	
cat ukb_common_snps.txt | cut -f3 -d"_"| sort| uniq > ukb_common_snps_rsid.txt
wc -l ukb_common_snps_rsid.txt
	9834852

Subset reference data to the GWAS list of SNPs

cd ../genotype_matrix_2/
plink --bfile ../1000GPdata/data_maf0.01_rs_ref --extract ../snp_lists/ukb_common_snps_rsid.txt --exclude ../snp_lists/sticky_snps.txt --out data_overlap --make-bed --keep-allele-order --snps-only
wc -l data_overlap.bim
	7,565,603

what this does:

subset only to position in UKB
drop sticky SNPs (they create NaN in LD calculation)
drop SNPs that have probelms with alleles
restrict flipping of alleles
keep only SNPs (not indels)

Creating genotype-phenotype matrices (S_XY) for studies

To join the within-study trait data used script 1_prepapre_data_XY.Rmd which saves the data in geno-pheno_matrix_UKB.RData.

Creating genotype matrix (S_XX) from reference data

Pruning data

plink --bfile data_overlap  --indep-pairwise 50 5 0.01 --out data_overlap
plink --bfile data_overlap --extract data_overlap.prune.in --make-bed --out data_pruned  --keep-allele-order
wc -l data_pruned.bim
	139,735

Annotate SNPs with gene names

cat <(echo -e "CHR\\tSNP\\tBP\\tREF\\tALT") <(cut -f1,2,4,5,6 data_pruned.bim) > data_pruned_snps.txt
plink --bfile --annotate data_pruned_snps.txt ranges=../snp_lists/glist-hg19 --out data_pruned
	60418 out of 139735 annotated

Subset data to the annotated genes

# first select SNPs that got annotated with gene names
less data_pruned.annot| tr ' ' \\t |  grep -v "\." > annotated_genes.txt
	60361 # SNPs
	
# output SNP list
less annotated_genes.txt | grep -v "^CHR"| cut -f2 > annotated_SNPs_list.txt	
	
# count SNPs per gene
less annotated_genes.txt | cut -f6 | sort | uniq -c | sort -k1n > annotated_SNPs_per_gene.txt
	13531 # genes

13,531 genes -- 60,361 SNPs -- 1-227? SNPs per gene -- on average 4.47? SNPs per gene

Subset data to SNPs to include, and split into chromosomes

plink --bfile data_overlap  --extract annotated_SNPs_list.txt  --out data_overlap_subset --keep-allele-order --make-bed

mkdir data_overlap_subset_by_chr
for chr in {1..22}; do
$plink --bfile data_overlap_subset --chr $chr --out data_overlap_subset_by_chr/data_overlap_chr${chr} --keep-allele-order --make-bed; 
done
 
# view SNPs by chr
wc -l data_overlap_subset_by_chr/*bim | sort -k1n

Create LD matrix from SNPs that belong to genes

Creates an LD matrix of r or r2 values from 502 European samples from 1000 Genomes phase 3 data.

mkdir ld_matrix_by_chr2/
for chr in {1..22}; do 
$plink --bfile data_overlap_subset_by_chr/data_overlap_chr${chr} --r2 square --out ld_matrix_by_chr/ld_matrix_chr${chr} --keep-allele-order;
 done
 
mkdir ld_matrix_by_chr/
for chr in {1..22}; do 
$plink --bfile data_overlap_subset_by_chr/data_overlap_chr${chr} --r square --out ld_matrix_by_chr/ld_matrix_chr${chr} --keep-allele-order;
 done

The 22 ld_matrix_chrN.ld files (10-379 MB) are the LD matrces that contain squared Pearson corrlation coefficients between the selected SNPs in the chromosome. Each matrix will be used as input to the metaCCA method as genotype-genotype matrix.

Tidying up LD matrix

Add row and column names (SNPs) to each chr LD matrix in R, to make it suitable to be used as S_XX in metaCCA.

See script: 1_prepapre_data_XX_per_chr.Rmd

The output is S_XX_matrices/LDmatrix_chrN.Rdata files that can be read in by the main metaCCA script.

Running metaCCA

Currently simple testing is done in 2_analyse_data_testing.Rmd .

Both

Univariate SNPs - multiple phenotypes analysis
Multivariate SNPs - multiple phenotypes analysis

are scripted in 3_run_metaCCA_analysis.R to be run in parallel process on BC3.

Submission script on BC3 (or here) in /newhome/ny19205/metaCCA_big_jobs: submit_metaCCA_jobs.sh

Submit one chr job to one node (16 cores):

qsub -l walltime=05:00:00 submit_metaCCA_jobs.sh -F "22"

Post-metaCCA results processing

Multivariate SNPs (gene-based analysis)

The output is 22 files metaCCA_multisnp_genes_chrN.tsv

Merge then and tidy up like this:

cat metaCCA_multisnp_genes_chr* | gsed  -z 's/,\n/,/g' | gsed  -z 's/\n"/"/g'  |  sort -k2 -r | uniq > multivar_metaCCA_merged_results.tsv

Not much intresting after that at the moment.

Univariate SNPs (SNPs-based analysis)

4_review_results_gwascat.Rmd - merge results with GWAS catalog and annotate by groups

In the middle of this R script can get LD proxies using Python script python_LDproxies/get_LD_proxies.py to match more things in GWAS catalog

5_visualise.Rmd - boxplots, dotplots, barplots

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis_ukb.md

analysis_ukb.md

Case study 1: UK Biobank, setting up metaCCA analysis

Selecting traits (IEU-GWAS db)

Preparing the input data

Reference data

Creating genotype-phenotype matrices (S_XY) for studies

Creating genotype matrix (S_XX) from reference data

Pruning data

Annotate SNPs with gene names

Subset data to the annotated genes

Subset data to SNPs to include, and split into chromosomes

Create LD matrix from SNPs that belong to genes

Tidying up LD matrix

Running metaCCA

Post-metaCCA results processing

Multivariate SNPs (gene-based analysis)

Univariate SNPs (SNPs-based analysis)

GWAS catalog annotation plots

Files

analysis_ukb.md

Latest commit

History

analysis_ukb.md

File metadata and controls

Case study 1: UK Biobank, setting up metaCCA analysis

Selecting traits (IEU-GWAS db)

Preparing the input data

Reference data

Creating genotype-phenotype matrices (S_XY) for studies

Creating genotype matrix (S_XX) from reference data

Pruning data

Annotate SNPs with gene names

Subset data to the annotated genes

Subset data to SNPs to include, and split into chromosomes

Create LD matrix from SNPs that belong to genes

Tidying up LD matrix

Running metaCCA

Post-metaCCA results processing

Multivariate SNPs (gene-based analysis)

Univariate SNPs (SNPs-based analysis)

GWAS catalog annotation plots