Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropulation - assingments.tsv.gz only contains 20% of filtered barcodes #29

Open
Thapeachydude opened this issue Dec 21, 2023 · 7 comments

Comments

@Thapeachydude
Copy link

Hi,

I'm dealing with a 10x data of multiple pools with ≈ 10 donors each that I would like to demultiplex using WES/WGS reference data. Following your recommendations I'm trying to use dropulation for this. Following your guide I've run dropulation but unfortunately most of the barcodes seem to be lost during AssignCellsToSamples step.
Specifically, the filtered barcodes 10x output contains ≈ 60k barcodes, but the assignments.tsv.gz file only has 13k.

Happy about any feedback : )

Best,
M

@drneavin
Copy link
Owner

Hi @Thapeachydude ,

Thanks for reaching out! First, I'd like to get a bit more background on this experiment. It seems like a lot of barcodes in the original 10x file, can you tell me what 10x platform was used for capture? We typically aim for 20k on the normal platform and 40k on the newer high-throughput platform. If more were captured, it might result in more ambient RNA. Does the knee plot in the output look as expected for low amounts of ambient RNA?

@Thapeachydude
Copy link
Author

Hi, the experiment was done on a nuclei prep using the 5-prime HT kit. I've found that with nuclei the cell count tends to be generally a bit higher than what one aims for (either due to some ambient contamination or it may be that their just harder to count accurately due to their size - we use automated counting). Naturally, this will result in a higher doublet rate. But our donors tend to separate very clearly transcriptionally, making the identification of doublets possible even at a high doublet rate. This particular pool is one in a series we've recently done. The reads-in-droplet percentage varies a bit ≈ 60-90% (most are around 80%). If barcodes some were lost during the process I would understand, but 13k is a bit to low, suggesting something is not going as intended.

(btw. I've tried using souporcell in the known genotypes mode, for some pools this works very well, for others the identified clusters don't really match the reference profiles).

@drneavin
Copy link
Owner

Hi @Thapeachydude

Thanks for the additional information. Yes that all makes sense. We also applied dropulation to some HT experiments recently and found a similar pattern with very few being called as singlets. I think it's worth getting @jamesnemesh opinion here since he's the developer of Dropulation and provided me with the script for calling singlets and doublets. It may be that the thresholds have to altered slightly for HT data or that there are some assumptions that are not met with that many cells. @jamesnemesh any thoughts or recommendations to test?

We also know that there is a decrease in the performance of most of the methods with higher ambient RNA. This seems to be less for souporcell and vireo in pools with less than 10 donors. Since you've already run souporcell, could you check the ambient percent it estimated?

@Thapeachydude
Copy link
Author

Thapeachydude commented Dec 22, 2023

Hi in the troublesome pools ≈ 45%, unfortunately. I guess that makes sense :/
The ones that demultiplexed reasonably well with souporcells were around ≈ 20%, which is only slightly above what I seem to get from single-cell data (never checked before, but seems to be around 15%). Curiously, removing the big blob of ambient droplets/doublets I see in a UMAP prior to running souporcell actually decreased its performance.

@drneavin
Copy link
Owner

We simulated up to 25% additional ambient RNA in Demuxafy (currently under review and hasn't been updated on biorxiv with the most recent changes) so we didn't get that high in the simulations. But on average I have noticed that nuclei data result in higher ambient RNA. Our single-cell datasets are usually ~5-15% depending on the experiment and design.

That is interesting about removing the cells for souporcell. I'm wondering if assumptions for the model for estimating ambient RNA are violated when you remove those cells because you've removed some data that could be important for continuity of the model.

You may want to try vireo as an alternative to pair with souporcell since it is slightly more robust to ambient RNA than the other methods but I still think you may end up with many unassigned cells. I would also recommend adding the --callAmbientRNAs flag for vireo when running since this will estimate the ambient RNA in each droplet. It's still listed as beta testing but I've found it to be relatively consistent with souporcell and cellbender ambient estimates and a helpful metric for QC.

@jamesnemesh
Copy link

jamesnemesh commented Dec 22, 2023

I can't speak to @drneavin's code, but the two donor assignment programs will emit a cell barcode in the output for every cell in the input, as long as there's at least one transcribed SNP. You do need something like 100 or so transcribed SNP UMI observations to have decent performance. If the original AssignCellsToSamples output is available I'd take a look at that. Doublet detection will have similar issues with needing around 100 or so transcribed SNPs, and may call cells with insufficient data doublets more often.

When we run these programs, we tend to focus on the cell barcodes we think are actually cells (or nuclei) in the experiment. We have a separate cell selection process that's slightly more useful than a knee plot - we use a combination of CellBender and visualization of the UMIs (log10) vs %intronic. Cellbender emits a probability that each cell is an empty or non-empty droplet, and the non-empty droplets become the superset from which we select cells. In the following plot, the (retained) on the X axis refers to the cell barcode library size after cellbender remove background has been applied.

image

We use AssignCellsToSamples CELL_BC_FILE argument to focus analysis on those 7613 nuclei in the upper right hand corner of the plot. Donor assignment is going to be pretty awful on the rest of the cell barcodes and likely assign more of them as doublets, since they are empty droplets that capture a mix of many donors. If you included all cell barcodes in the plot, you'd likely infer the wrong doublet rate.

I haven't seen any experiments where 60K true cells/nuclei works very well, the doublet rate would be very high at that loading.

@jamesnemesh
Copy link

@drneavin This might be the best place to put this:

We've released both a much more full set of documentation and an R library that generates a number of useful QC plots and evaluates the donor assignment and doublet detection outputs.

For people having issues, it may be very helpful to look at the docs to see if the programs are running correctly, and run the QC plots to have a common starting point for discussion about issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants