Applying metaCCA to IEU-GWAS database

Project folder on epi-franklin:/projects/XremovedX/.

Project background

A dominant approach to genetic association studies is to perform univariate tests between genotype-phenotype pairs, where each SNP is examined independently for association with the given phenotype.

However, if we could analyse the results of multiple GWAS studies together in a joint-analysis, this would not only provide increased statistical power, but also may reveal certain complex associations that are only detectable when several variants or traits are tested jointly.

In IEU-GWAS db there is a lot of data (28K traits) that can be investigated in that way, with many traits that could have novel associations and correlations. We’re particularly interested in finding genetic correlations between phenotypes that may be collectively contributing to a groups of diseases. In other words, we are interested in finding genes with pleiotropic effects. Those genes are not well-defined but are abundant, as findings of many GWAS studies overlap. Pleiotropy term is used to describe a scenario when the same locus (SNP/gene) affects multiple traits, via two main mechanisms:

horizontal pleiotropy: may lead to a better understanding of biological processes that are common between traits
vertical pleiotropy: can inform on causality for intervention strategies for disease prevention.

metaCCA

Paper, R package vingette

Brief introduction

MetaCCA can be used to systematically identify potential pleiotropic genes using GWAS summary statistics by combining correlation signals among multiple traits.

metaCCA uses GWAS summary statistics (𝛽 and std.err)
Can combine single or multiple studies in one analysis
Can use multivariable representation of both genotype and phenotype
Based on CCA (canonical correlation analysis)
Result is the maximized correlation coefficient R1

metaCCA provides two types of the multivariate association analysis:

Single-SNP–multi-trait analysis: 1 SNP → N traits

One genetic variant tested for an association with a set of phenotypic variables

Multi-SNP–multi-trait analysis: N SNPs [genes] → N traits

A set of genetic variants tested for an association with a set of phenotypic variables.

The method

metaCCA operates on three pieces of the full data covariance matrix:

S_XX of genotype-genotype correlations
S_XY of univariate genotype-phenotype association results
S_YY of phenotype-phenotype correlations.

S_XX is estimated from a reference database matching the study population, e.g. the 1000 Genomes. S_YY is estimated from S_XY.

Workflow

The analysis cointains several stages:

Traits/data selection
Input data processing/cleaning (both GWAS and reference)
Input matrix generation (S_XY)
Reference matrix generation (S_XX) *NB there may be some overlap/depencence between 3 and 4
Run metaCCA script (submit on BC3)
Output processing
Output annotation with GWAS catalog
Visualisation

Case studies

While exploting metaCCA, I have done several case studies to investigate various properties of metaCCA. Each case study is described in a separate README.

UK Biobank only (easiest working case) here
UK Biobank + GIANT here
UK Biobank (IEU) + UK Biobank (Neale Lab)here

Scripts in this repo

Workflow-related

── main_workflow
│   ├── select_traits/biobank_traits_parser.Rmd
│   ├── parse_gwas_vcf.sh *OR*
│   ├── parse_gwas_vcf_snakemake/ 
│   ├── 0_standardise_nealelab_data.Rmd
│   ├── 1_prepare_data_XX_by_chr.Rmd
│   ├── 2_prepare_data_XY.Rmd
│   ├── 3_run_metaCCA_analysis.R
│   ├── 3_runmetaCCA_testing_manually.Rmd
│   ├── 4_review_results_gwascat.Rmd
│   ├── 5_visualise.Rmd
│   ├── python_LDproxies

Exploratory scripts

├── exploratory_analysis
│   ├── compare_effect_size.Rmd
│   ├── compare_r_and_r2_results.Rmd
│   ├── compare_results_UKBvsUKBGIANT.Rmd
│   └── manhattan_plot.Rmd

Data folders

.. are outside this repo, but the structure is as follows:

├── 1000GPdata			# raw reference data
├── annotation			# gene annotation files
├── genotype_matrix_1	# interemetiate files from case study 1
├── genotype_matrix_2	# interemetiate files from case study 2
├── genotype_matrix_3	# interemetiate files from case study 3
├── gwas_catalog		# GWAS catalog raw and subset and annotation files
├── README.txt
├── results				# weekly file storage
├── snp_lists			# intermediate and common-to-all files
├── S_XX_matrices		# per chr LD matrices for each case study
└── S_XY_matrices		# XY matrices for each case study + standardised tsv (from rawVCF)

External data

Reference 1000GP data is from https://github.com/MRCIEU/gwasvcftools:

1kg European reference panel for LD data_maf0.01_rs_ref.tgz: 9,003,401

Located here: data/1000GPdata/data_maf0.01_rs_ref

Gene annotation glist-hg19 reference available here: https://www.cog-genomics.org/plink/1.9/resources (~24K genes on autosomes)
GWAS catalog download is from here: https://www.ebi.ac.uk/gwas/docs/file-downloads (v1.0.2)

Data viz

Showing some examples of plots I've made over the course of the project

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
plots		plots
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Applying metaCCA to IEU-GWAS database

Project background

metaCCA

Brief introduction

The method

Workflow

Case studies

Scripts in this repo

Workflow-related

Exploratory scripts

Data folders

External data

Data viz

GWAS catalog annotation plots

UKB VS GIANT explorations

About

Releases

Packages

Languages

mvab/apply_metaCCA

Folders and files

Latest commit

History

Repository files navigation

Applying metaCCA to IEU-GWAS database

Project background

metaCCA

Brief introduction

The method

Workflow

Case studies

Scripts in this repo

Workflow-related

Exploratory scripts

Data folders

External data

Data viz

GWAS catalog annotation plots

UKB VS GIANT explorations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages