Dropulation - assingments.tsv.gz only contains 20% of filtered barcodes #29

Thapeachydude · 2023-12-21T09:10:36Z

Hi,

I'm dealing with a 10x data of multiple pools with ≈ 10 donors each that I would like to demultiplex using WES/WGS reference data. Following your recommendations I'm trying to use dropulation for this. Following your guide I've run dropulation but unfortunately most of the barcodes seem to be lost during AssignCellsToSamples step.
Specifically, the filtered barcodes 10x output contains ≈ 60k barcodes, but the assignments.tsv.gz file only has 13k.

Happy about any feedback : )

Best,
M

The text was updated successfully, but these errors were encountered:

drneavin · 2023-12-21T20:25:52Z

Hi @Thapeachydude ,

Thanks for reaching out! First, I'd like to get a bit more background on this experiment. It seems like a lot of barcodes in the original 10x file, can you tell me what 10x platform was used for capture? We typically aim for 20k on the normal platform and 40k on the newer high-throughput platform. If more were captured, it might result in more ambient RNA. Does the knee plot in the output look as expected for low amounts of ambient RNA?

Thapeachydude · 2023-12-21T20:41:15Z

Hi, the experiment was done on a nuclei prep using the 5-prime HT kit. I've found that with nuclei the cell count tends to be generally a bit higher than what one aims for (either due to some ambient contamination or it may be that their just harder to count accurately due to their size - we use automated counting). Naturally, this will result in a higher doublet rate. But our donors tend to separate very clearly transcriptionally, making the identification of doublets possible even at a high doublet rate. This particular pool is one in a series we've recently done. The reads-in-droplet percentage varies a bit ≈ 60-90% (most are around 80%). If barcodes some were lost during the process I would understand, but 13k is a bit to low, suggesting something is not going as intended.

(btw. I've tried using souporcell in the known genotypes mode, for some pools this works very well, for others the identified clusters don't really match the reference profiles).

drneavin · 2023-12-21T21:00:01Z

Hi @Thapeachydude

Thanks for the additional information. Yes that all makes sense. We also applied dropulation to some HT experiments recently and found a similar pattern with very few being called as singlets. I think it's worth getting @jamesnemesh opinion here since he's the developer of Dropulation and provided me with the script for calling singlets and doublets. It may be that the thresholds have to altered slightly for HT data or that there are some assumptions that are not met with that many cells. @jamesnemesh any thoughts or recommendations to test?

We also know that there is a decrease in the performance of most of the methods with higher ambient RNA. This seems to be less for souporcell and vireo in pools with less than 10 donors. Since you've already run souporcell, could you check the ambient percent it estimated?

Thapeachydude · 2023-12-22T00:03:44Z

Hi in the troublesome pools ≈ 45%, unfortunately. I guess that makes sense :/
The ones that demultiplexed reasonably well with souporcells were around ≈ 20%, which is only slightly above what I seem to get from single-cell data (never checked before, but seems to be around 15%). Curiously, removing the big blob of ambient droplets/doublets I see in a UMAP prior to running souporcell actually decreased its performance.

drneavin · 2023-12-22T00:26:08Z

We simulated up to 25% additional ambient RNA in Demuxafy (currently under review and hasn't been updated on biorxiv with the most recent changes) so we didn't get that high in the simulations. But on average I have noticed that nuclei data result in higher ambient RNA. Our single-cell datasets are usually ~5-15% depending on the experiment and design.

That is interesting about removing the cells for souporcell. I'm wondering if assumptions for the model for estimating ambient RNA are violated when you remove those cells because you've removed some data that could be important for continuity of the model.

You may want to try vireo as an alternative to pair with souporcell since it is slightly more robust to ambient RNA than the other methods but I still think you may end up with many unassigned cells. I would also recommend adding the --callAmbientRNAs flag for vireo when running since this will estimate the ambient RNA in each droplet. It's still listed as beta testing but I've found it to be relatively consistent with souporcell and cellbender ambient estimates and a helpful metric for QC.

jamesnemesh · 2023-12-22T02:42:06Z

I can't speak to @drneavin's code, but the two donor assignment programs will emit a cell barcode in the output for every cell in the input, as long as there's at least one transcribed SNP. You do need something like 100 or so transcribed SNP UMI observations to have decent performance. If the original AssignCellsToSamples output is available I'd take a look at that. Doublet detection will have similar issues with needing around 100 or so transcribed SNPs, and may call cells with insufficient data doublets more often.

When we run these programs, we tend to focus on the cell barcodes we think are actually cells (or nuclei) in the experiment. We have a separate cell selection process that's slightly more useful than a knee plot - we use a combination of CellBender and visualization of the UMIs (log10) vs %intronic. Cellbender emits a probability that each cell is an empty or non-empty droplet, and the non-empty droplets become the superset from which we select cells. In the following plot, the (retained) on the X axis refers to the cell barcode library size after cellbender remove background has been applied.

We use AssignCellsToSamples CELL_BC_FILE argument to focus analysis on those 7613 nuclei in the upper right hand corner of the plot. Donor assignment is going to be pretty awful on the rest of the cell barcodes and likely assign more of them as doublets, since they are empty droplets that capture a mix of many donors. If you included all cell barcodes in the plot, you'd likely infer the wrong doublet rate.

I haven't seen any experiments where 60K true cells/nuclei works very well, the doublet rate would be very high at that loading.

jamesnemesh · 2024-03-26T14:26:52Z

@drneavin This might be the best place to put this:

We've released both a much more full set of documentation and an R library that generates a number of useful QC plots and evaluates the donor assignment and doublet detection outputs.

For people having issues, it may be very helpful to look at the docs to see if the programs are running correctly, and run the QC plots to have a common starting point for discussion about issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropulation - assingments.tsv.gz only contains 20% of filtered barcodes #29

Dropulation - assingments.tsv.gz only contains 20% of filtered barcodes #29

Thapeachydude commented Dec 21, 2023

drneavin commented Dec 21, 2023

Thapeachydude commented Dec 21, 2023

drneavin commented Dec 21, 2023

Thapeachydude commented Dec 22, 2023 •

edited

Loading

drneavin commented Dec 22, 2023

jamesnemesh commented Dec 22, 2023 •

edited

Loading

jamesnemesh commented Mar 26, 2024

Dropulation - assingments.tsv.gz only contains 20% of filtered barcodes #29

Dropulation - assingments.tsv.gz only contains 20% of filtered barcodes #29

Comments

Thapeachydude commented Dec 21, 2023

drneavin commented Dec 21, 2023

Thapeachydude commented Dec 21, 2023

drneavin commented Dec 21, 2023

Thapeachydude commented Dec 22, 2023 • edited Loading

drneavin commented Dec 22, 2023

jamesnemesh commented Dec 22, 2023 • edited Loading

jamesnemesh commented Mar 26, 2024

Thapeachydude commented Dec 22, 2023 •

edited

Loading

jamesnemesh commented Dec 22, 2023 •

edited

Loading