pybustools

This is a python interface to the bustools-format for single cell RNAseq usually produced using kallisto-bustools.

Installation

Note the required gmpy2 library, which itself relies on some c-libraries. These can be installed (on ubuntu) via apt-get install libgmp3-dev libgmp-dev libmpfr-dev libmpc-dev

git clone https://github.com/redst4r/pybustools
pip install -e pybustools

Usage

from pybustools.pybustools import Bus

B = Bus(folder='/path/to/bus/output', bus_name='sorted.bus')

# iterate the records one by one; it yields namedtuples
for record in B.iterate_bus():
    record.CB, record.UMI, record.EC, record.COUNT, record.FLAG  # str, str, int int, int

# aggregate the output of entire cells (WARNING: bus-file must be SORTED)
for cb, umi_list  in B.iterate_cells():
    """
    umi_list contains all UMIs of that cell:
          CB                    UMI         EC  Count  FLAG
    ('AAACCCAAGAACCGCA', [('GCATTAAATCAG', 822, 1,     0),
                          ('GCATTAAATCCG', 822, 1,     0)])
    """

    for umi, ec, count, flag in umi_list:
        pass

Some tricks for speed

The bus-format encodes sequences (CB/UMI) as ints (from the docs: ... the barcode GCCA corresponds to the bit code 10010100 or the integer 148). Decoding these into sequences is currently a performance bottleneck (done via gmpy2.digits() atm, which is already a pretty fast C-implementation). If you don't care about actual sequence, you can turn of the decoding step and iterating over the bus records will return ints for CB/UMI:

B = Bus(folder='/path/to/bus/output', decode_seq=False)
for record in B.iterate_bus():
    print(record.CB, record.UMI)  # these will be python ints now instead of str

Notes

Multiple bus records for the same CB/UMI

Ideally there's only as single bus record for each molecule (CB/UMI). However, occasionally, one finds multiple bus records with the same CB/UMI but different ECs.

Here's how that can happen:

A molecule (a specific mRNA) gets amplified. Some cDNAs of the molecule are longer, others shorter (PCR terminated early)
Sequencing these different cDNAs from the same molecule yields different reads.
These reads can map to different ECs.

exons:  -----Exon1-----|----Exon2------
mRNA:   --------------------------------
cDNA1:  -------------                <-  EC: {E1}
cDNA2:  ------------------------     <-  EC: {E1,E2}

Presumably, it can also map to a completely different region, due to sequencing/PCR errors!?

Speed

It's generally pretty fast when turning off the sequence decoding. We can process approximately 2 million bus entries per second (standard hardware). Slight gains might be achived by tuning the buffersize: (y-axis is sec/1e7 entries)

TODO

parsing the binary format is slow; the bottleneck is busio._decode_int_to_ACGT() and gmpy2.digits() in there. This converts the int encoding into cDNA sequence

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
pybustools		pybustools
speed		speed
src		src
testing		testing
.gitignore		.gitignore
.travis.yml		.travis.yml
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py.old		setup.py.old

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pybustools

Installation

Usage

Some tricks for speed

Notes

Multiple bus records for the same CB/UMI

Speed

TODO

About

Releases

Packages

Languages

License

redst4r/pybustools

Folders and files

Latest commit

History

Repository files navigation

pybustools

Installation

Usage

Some tricks for speed

Notes

Multiple bus records for the same CB/UMI

Speed

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages