Skip to content

Latest commit

 

History

History
60 lines (43 loc) · 4.33 KB

README.md

File metadata and controls

60 lines (43 loc) · 4.33 KB

Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression

This repository contains data, code, and figures generated for the manuscript:

Laura Luebbert, Delaney K Sullivan, Maria Carilli, Kristján Eldjárn Hjörleifsson, Alexander Viloria Winnett, Tara Chari, Lior Pachter (2023). [Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression](https://www.biorxiv.org/content/10.1101/2023.12.11.571168). bioRxiv 2023.12.11.571168; doi: https://doi.org/10.1101/2023.12.11.571168

The preprint is posted on the bioRxiv: https://www.biorxiv.org/content/10.1101/2023.12.11.571168

💡 General tutorials with example data can be found on the kallisto bustools website:

When interpreting the presence of RdRP-like sequences / virus IDs, keep in mind that there will likely be many RdRP-like sequences introduced by contamination of laboratory reagents. A (non-comprehensive) list of virus IDs observed in blank sequencing data is available here.

The Notebooks folder contains code to perform all analyses that were used for the preprint, starting with pre-processing of the raw data all the way to final figure generation. The notebooks are easily and readily executable via Google Colaboratory with a link directly to the site from each notebook page.

Large datasets are stored on Caltech Data and can be accessed under the DOIs 10.22002/krqmp-5hy81 and 10.22002/k7xqw-88d74.

Click here to view the interactive Krona plot showing all viruses expressed above the QC threshold in macaque cells that passed quality control, broken down by animal, timepoint, taxonomy, and fraction of positive cells occupied by each virus. Code to reproduce the Krona plot

The precomputed_refs folder contains precomputed reference indices for the detection of viral RNA in sequencing data (through alignment to the optimized PalmDB) and with masked human (or mouse) genome and transcriptome.

A description of kallisto, bustools, and kb-python including tutorials for their use can be found here: https://www.biorxiv.org/content/10.1101/2023.11.21.568164



# 1. Install kb-python (optional: install gget to fetch the host genome and transcriptome)
pip install kb-python gget

# 2. Download optimized PalmDB reference files
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt

# 3. Create reference index (+ optional masking of the host, here human, genome using the D-list)
# Single-thread runtime: 1.5 h; Max RAM: 4.4 GB; Size of generated index: 593 MB
# Without D-list: Single-thread runtime: 3.5 min; Max RAM: 3.9 GB; Size of generated index: 592 MB
kb ref \
    --aa \
    -k 55 \
    --d-list $(gget ref --ftp -w dna homo_sapiens) \
    -i index.idx --workflow custom \
    palmdb_rdrp_seqs.fa
    
# 4. Align sequencing reads
# Single-thread runtime: 1.5 min / 1 million sequences; Max RAM: 2.1 GB
kb count \
    --aa \
    -k 55 \
    -i index.idx -g palmdb_clustered_t2g.txt \
    --parity single \
    -x default \
    $USER_DATA.fastq.gz

Overview_v3_noCode