Pipeline to predict HiC matrices from ATAC-seq fastq files using maxATAC and C.Origami
If you are working on Ultraviolet (aka BigPuprle) you need to setup an environment. Which means you need to install snakemake and mamba (so a conda within a conda).
module load anaconda3/gpu
conda create -n snakemake -c conda-forge -c bioconda mamba snakemake
To configure conda to work on Ultraviolet you can edit the ultraviolet.yaml
that comes with
this repo to match your liking and copy it to your .config
directory as shown.
mkdir -p ~/.config/snakemake/ultraviolet/
cp ultraviolet.yaml ~/.config/snakemake/ultraviolet/config.yaml
To run the pipeline you need to:
- create a samplesheet with information about your samples
- setup a directory with with data to run
C.Origami
- edit the
config/config.yaml
file to specify the paths to the relevant directories.
The samplesheet should have information about sample and replicate names and the path to the fastq files.
See the config/sample_meta.csv
that comes with this directory for an example.
Alternatively you can specify a column called Run
with the SRR
ids of the samples and the pipeline should automatically download them for you (NOT TESTED).
The C.Origami directory should look like this:
<corigami_base>/
├── data
│ ├── <genome>
│ │ ├── centrotelo.bed
│ │ └── dna_sequence
│ │ ├── chr10.fa.gz
│ │ ├── chr11.fa.gz
│ │ ├── ...
│ │ ├── chrX.fa.gz
│ │ └── chrY.fa.gz
│ └── <genome>_tiles.bed
└── model_weights
└── <corigami_model>.ckpt
Where corigami_base
, genome
and corigami_model
are specified in config/config.yaml
.
The <corigami_base>/data
directory can be:
- Downloaded from here (you will need to
untar
it) - If you work from within Ultraviolet, symlinked from here:
/gpfs/data/tsirigoslab/home/jt3545/hic_prediction/C.Origami-release/corigami_data/
To get the model weights you need to:
- Train your own model and save the checkpoint. Ask Javier Rodriguez Hernaez for details.
- Download a pretrained hg38 model checkpoint created by Javier from here.
If in Ultraviolet symlink/copy the following path:
/gpfs/home/rodrij92/PROJECTS/SHARE/epoch=53-step=64260.ckpt
The main parameter you may need to specify are:
- genome: either
hg38
ormm10
- sample_meta: path to samplesheet
- corigami_base: directory with C.Origami data
- corigami_model: name of checkpoint file under
<corigami_base>/model_weights
If you have specify everything correctly you can launch the pipeline by executing the following commands on Ultraviolet:
conda activate snakemake # activate environment you created in Ultraviolet if you don't have snakemake
snakemake --profile ultraviolet
The repo has the script workflow/scripts/predict_translocation.py
(still under development) that is not part of the pipeline,
but can be used to predict the result of simple translocations. Simple translocations are defined as a merger of 2 chromosomes
at a specific position (no indels or substitutions involved). Below is a schematic of all the simple translocations:
These translocations are defined in a VCF file like this:
#CHROM POS ID REF ALT QUAL FILTER INFO
2 321681 bnd_W G G]17:198982] 6 PASS .
2 321682 bnd_V T ]13:123456]T 6 PASS .
13 123456 bnd_U C C[2:321682[ 6 PASS .
13 123457 bnd_X A [17:198983[A 6 PASS .
17 198982 bnd_Y A A]2:321681] 6 PASS .
17 198983 bnd_Z C [13:123457[C 6 PASS .
These specify translocations can be given as arguments to predict_translocation.py
as follows (embed them in single quotes to stop bash from interpreting the symbols):
chr2:321681]chr17:198982]
]chr13:123456]chr2:321682
chr13:123456[chr2:321682[
[chr17:198983[chr13:123457
chr17:198982]chr2:321681]
[chr13:123457[chr17:198983