SVsig is a method developed to classify rearrangements as passenger or driver in cancer patient cohort of whole genome sequences. The distribution of rearrangements in the cancer genome is shaped by both the mechanisms of their formation and the fitness advantages they confer on the cell. This analysis revealed significant predictors of the distribution of rearrangement across the genome and identified known and novel rearrangements that recurred more often than expected given these predictions (for more detailed description: https://doi.org/10.1101/2023.10.13.561748)
SVsig uses MATLAB, which can be obtained here. This version has primarily been tested using MATLAB_R2020a
on macOS Sonoma (14.5).
Additionally, install the following MATLAB toolboxes:
- Statistics and Machine Learning Toolbox
- Optimization Toolbox
Clone this repo into the directory you wish to run SVSig in.
SVsig takes in an input .csv file with 15 columns. An example is in this repo under data/merged_1.6.1.csv
. Your file must match the column names exactly.
- seqnames, start, strand, altchr, altpos, altstrand: genomic coordinates of both rearrangement breakpoints.
- Note that chromosome coordinates are integers only. chrX and chrY are changed to 23 and 24, respectively.
- dcc_project_code: histology or tissue type information.
- sv_id: ID for individual rearrangement
- sid: Patient ID for the rearrangement
- donor_unique_id: Patient ID for the rearrangement
- weights: Ranges from 0-1, representing the weight an individual connection is given to an entire rearragnement event.
Optional columns: If this information is not available, set column values to arbitrary value. Will not affect ability to run the model.
- topo: rearrangement topology information
- topo_n: number of rearrangements involved in topology.
- mech: DNA damage repair mechanism predicted to generate the rearrangement.
- homseq: number of base pairs of microhomology at the breakpoint junctions.
SVsig-2D considers each rearrangement to occur independently of each other.
- Open
bin/run2DModel.m
- Set the paths to the working directory, rearrangements file you wish to analyze, and output destination file
- Ensure: complex, weights, and model_exist parameters are false
- Run
run2DModel.m
SVsig-2Dc accounts for novel connections that arise from neiboring rearrangements.
-
To first identify neighboring rearrangements, run JaBbA to obtain a juxtapositions file.
-
Open
bin/run2DModel.m
- Set the paths to the working directory, rearrangements file you wish to analyze, and output destination file
- Set the weights and complex parameters to true.
- Run
run2DModel.m
- After line 44 in
mix_model_param.m
, runmix_model_alpha.R
- Continue running
mix_model_param.m
until completion
- model_exist: Boolean to skip model training and use a pre-determined background model. If True, add path to background model in line 23 (complex model) or 25 (simple model) of runSVSig.m.
- len_filter: Only considers rearrangements above this length for calculating significance. Default is 1Mb.
- bks_cluster: Set to 1.
- FDR_THRESHOLD: FDR threshold for determining significance.
- output_file: path to output file
- complex: Boolean to run SVSig-2Dc (complex model).
- num_breakpoints_per_bin: Average number of breakpoints within a bin. Determines bin boundaries so that each tile has approximately this number of breakpoints. Currently not used.
- bin_length: Length of bin to divide genome. Suggested ranges are 500kb - 2Mb. Note that the number of calculations scales quadratically as bin_length decreases.
- weights: Weight given to each individual connection, ranges from 0-1. Weight=1 for the simple model. For the complex model, weights are obtained from the juxtapositions file after running JaBbA
- genome_build: 'hg19' or 'hg_38'.
SVsig-2D and SVsig-2Dc output a file containing significantly recurrently events. Each unique event is denoted with by a cluster number. The genomic coordinates, subtype, and ID information for each rearrangement in a cluster are displayed. In addition, the following columns are present:
- cluster_num: Cluster number each connection belongs to.
- pval: Significance for the rearrangement event.
- num_hits: Number of unique samples containing the rearrangement.
To ensure that SVsig is installed and running properly, we will run the file data/TUTORIAL_rearrangements.csv
, which contains a random sampling of 100,000 rearrangements from the dataset in the manuscript. Change the following parameters (use the default for the remaining parameters) and run SVSig-2D:
- bin_length: 1e6
- FDR_THRESHOLD: 0.01
Runtime was measured to be around 7 minutes on a standard laptop with 16GB RAM. The expected output file is shown at results/TUTORIAL_hitsalljunctions_fdr0.01_1e6bins.txt
.
To recreate the results in the manuscript from SVSig-2D, use the data/merged_1.6.1.csv
file, which includes the full set of nearly 300,000 rearrangements from the PCAWG cohort. Additionally, use the following parameters:
- bin_length: 5e5
- FDR_THRESHOLD: 0.1
Common issues with running SVSig often involve the number of rearrangements in your dataset. SVSig requires a large number of rearrangements since they become sparse once distributed across the genome-wide adjacency matrix. Additionally, at least one rearrangement needs to exist on every chromosome. Ideally, there are at least 100,000 rearrangements in your dataset, although we have run SVSig with data containing only 50,000 rearrangements. For smaller datasets, we recommend increasing the bin_length parameter and increasing the FDR.
Another option for smaller datsets is generating and loading in the background model using the PCAWG rearrangements (provided in this repo). Afterwards, rearrangements in the new dataset that occur at a higher frequency than the PCAWG background rate can be detected.
Author: Shu Zhang, [email protected], [email protected]
Contact: Rameen Beroukhim, [email protected]
License: GNU AGPL3, Copyright (C) 2023 Dana-Farber Cancer Institute
Please cite: Zhang S, Kumar KH, Shapira O, et al. Detecting significantly recurrent genomic connections from simple and complex rearrangements in the cancer genome. bioRxiv (2023). https://doi.org/10.1101/2023.10.13.561748