-
Notifications
You must be signed in to change notification settings - Fork 19
DoubletDecon Vignette with Dexmulet Data
Thank you for you interest in using DoubletDecon to remove doublets in your single-cell RNA-sequencing dataset. In this vignette I am going to show you how to prepare data, select appropriate parameters, and run DoubletDecon using a real data set with biologically verified doublets. While this vignette accurately portrays the steps I would take to run the software and analyze the results, I do not go into detail on all of the parameters and limitations you should consider when running this on your data. Before starting, please read the original paper DePasquale et al., 2019 and the protocol DePasquale et al., 2020, both of which include this information. Through this vignette I will show you the following things:
- Pre-Processing Steps
- Running DoubletDecon (including parameter selection)
- Analyzing the Results
We will be using real data for this vignette, as it best expemlifies the messiness and variablility inherient to single-cell data. The specific data I chose is the PBMC data from the Demuxlet paper. We are choosing this because 1) it is one of the primary datasets we use in the original Cell Reports paper to demostrate the accuracy of DoubletDecon, 2) it is similar in size and quality to the datasets we recieve from our collaborators — all of which undergo doublet removal, and 3) it has a large enough number of biologically related clusters to properly demonstrate the selection process for the cluster merging parameter.
The protocol for DoubletDecon includes a section titled 'Before You Begin' that describes the 5 steps and choices you should consider when preparing your data for input into DoubletDecon. These steps include:
- Data Collection
- Read Alignment
- Filtering
- Population Detection
- Dataset Integration
As described above, these data were not generated in out laboratory, but instead is a publicly available dataset that has become one of the "ground truth" datasets for testing doublet detection algorithms. You can find a description of how these data were collected HERE. Briefly, "A dataset containing 14,619 cells single human PBMCs isolated using a 10x Chromium instrument (10x Genomics) and sequenced using Illumina HiSeq2500 Rapid Mode with approximately 25,000 reads per cell was acquired from GEO (GSE96583) (Kang et al., 2018). The authors report an estimated doublet rate of 10.9% based on their demultiplexing procedure and note that it is consistent with the expected rate (10x technology and number of loaded cells). Following the observation that this reported rate likely estimates only the heterotypic rate (McGinnis et al. 2019 and Wolock et al. 2019), we also employ an 8/7 multiplicative adjustment to arrive at an estimated overall doublet rate of 12.5%. Additionally, cell classifications of doublets and singlets were provided by the original authors, derived using Demuxlet (GSE96583). The classification for the barcodes in the dataset is as follows: 13,030 singlets, 1,565 doublets (10.7%) and 24 ambiguous." (DePasquale et al., 2019)
We used the read counts provided by the authors for this dataset, which is best described in Kang et al., 2018 mentioned above. For reference, our protocol states that the users should align their reads according to the manufacturer’s recommendations for all single-cell sequencing technologies.
In our protocol we recommend filtering genes based on the manufacturer’s recommendations. In general, we recommend at least 500 genes expressed per cell in most cases, though DoubletDecon has been shown to perform well with as low as 150 genes per cell barcode. In the case of this particular dataset we chose 150 genes per cell barcode, because we found that using a cutoff as low as 300 genes dramatically reduced the number of cells that met this filtering criterion. When in doubt, follow the manufacturer's recommendations but make sure to perform a sanity check to make sure you arent tossing out all of your cells! Even with this lowered threshold we ended up with 6,525 cells (down from ~14,000), resulting in 1,426 biologically validated doublets and a 21.9% doublet rate.
We used both Seurat 2 (latest version at the time of publication) and ICGS for population detection, but opted to follow up with the results from ICGS for the tutorial simply for ease of use. All graphs and files generated with ICGS, such as heatmaps and UMAPS, are also standard output of the Seurat pipeline.
Because DoubletDecon is dependent on your knowledge of the cells present in the dataset, cell population should also be performed using marker genes for each cell cluster. An example of this is shown below using the ICGS+AltAnalzye software with gene set analysis with the 'Human Immune Cells' gene sets.
Shown is the final heatmap with gene set labels on the left side:
It may also be helpful, particularly in cases where population identification is difficult, to generate a UMAP or other dimensionality plot to help understand the relationships and similarities between the clusters in the dataset.
As you can see in the heatmap, some cell populations have very similar expression patterns, to the point of being nearly indistinguishable. This is also visible on the UMAP where the first 5 or so clusters appear to overlap in the upper left corner. For best performance, we recommend adjusting the resolution of clustering in the population detection software you will use prior to running DoubletDecon, so that all clusters are transcriptionally distinct. For the purposes of this tutorial we purposefully left the clustering in this non-ideal state so we can show you how to correct this later on with the cluster merging parameter in DoubletDecon.
Dataset integration was not performed on this dataset. However, if you wish to combine multiple datasets, please run doublet removal software prior to integration. For more information, please check out the paper for our integration method, cellHarmony, which also includes information on alternative approaches.
The protocol for DoubletDecon next outlines the 5 required steps for running DoubletDecon, which include:
- Installing DoubletDecon
- Formatting input files for use with DoubletDecon
- Determining the cluster merging parameter (ρ')
- Selecting additional parameters
- Running DoubletDecon
It also includes 2 optional steps users can perform to assess the confidence of their doublet predictions through multiple runs or in combination with other doublet removal methods. These are not included in the vignette at this time.
DoubletDecon can be installed within R or RStudio using the following commands:
install.packages("devtools") # If not already installed
devtools::install_github('EDePasquale/DoubletDecon')
To access the user interface version of the software, first download the DoubletDecon GitHub repository. After unzipping the directory, you will find the 'DoubletDecon.app' that you can click on to start the application. This will first open R, and soon after will load the user interface in your default browser. While the user interface version of DoubletDecon is easy to use, it is also currently available only for Mac users. Therefore, I will be using the command line version of the software for this vignette to ensure that it is most helpful to as many users as possible.
Note: If you are using the user interface version of DoubletDecon, you may notice that Apple gives you a security warning concerning opening an application from an unknown developer. We are aware of this issue, which is only affecting the latest OS versions, and are undergoing the process of application signing (done!) and notarization (in progress). We thank you for your patience and understanding!
DoubletDecon was designed to work natively with ICGS-derived population detection, so no processing is required to put the files into the correct format for use with DoubletDecon. If you are using ICGS you will find the following folder structure in your project directory:
The files you need are noted with the arrows and associated parameter names.
Note: If you are instead using Seurat as your population detection software, you will first need to convert files into DoubletDecon format. To do this, we have created the Improved_Seurat_Pre_Process() function, that greatly simplifies the process. As this process is not relevant to this particular vignette, we will direct you to the protocol, however we intend to create a second vignette for Seurat users in the future.
Your input files should have the following layout and headers when viewing in Excel:
Expression (rawDataFile):
Groups (groupsFile):
Full Expression (fullDataFile):
This is the most critical parameter for DoubletDecon and the subject of the most user questions. This cluster merging parameter ρ', also called rho-prime, is designed to correct overly granular clustering that I described in the Pre-Processing section. The end goal of this parameter is to merge clusters with highly similar gene expression patterns so simulated doublets are not made between two near-identical cells (which would result in an abundance of false positive doublet predictions). It is important to realize that when two clusters are merged, it will no longer be possible to detect doublets between them, so merge carefully.
In this dataset I first went back to the heatmap I generated during preprocessing. I noticed that clusters 1-7 shared a common gene expression pattern, in which each cluster shows some minor differences while still sharing the same overall larger expression signature. If we leave these cells as 7 separate clusters, we will create simulated doublets between clusters 1 and 2, for example, that will look too similar to real cells in cluster 3 and lead to false positives. I also look at the gene enrichment results and see that all of these cells are monocytes or monocyte progenitors, further leading me to merge these clusters:
Real World: Choices like this depend on the research questions being asked of the data. When I use hematopoietic data, I am often looking for signs of multi-potential priming, where I expect to see cells that express genes related to multiple different lineages. Even if doublets exist between two monocyte populations, I am okay leaving them in as they do not affect the research question I am asking. In all data analysis we can ease the difficulty of making decisions about parameters by keeping our end goal in mind!
I also noticed that clusters 9 and 10 are both T-cells with similar expression patterns, though the expression is "stronger" in cluster 10. Again, without merging I would likely lose most of the real cells in these clusters:
Finally, I look at the UMAP for these clusters and see cells from clusters 1-7 on top of each other in a blob, further confirmation that they should be merged, as synthetic (and real) doublets created between these clusters will also reside in the blob:
Now that I have decided what clusters should be merged I tweak the cluster merging parameter until the cluster merging heatmap best shows the similarities between these clusters. In this dataset that means a value of 1.2, though I also tested the surrounding values to see if there was a better fit:
With the cluster merging parameter out of the way, we can now talk about the rest of the parameters. For the most part I use the default parameters, as I find those give the best results. Many of these used to be "baked in" to the software and unchangeable, but since some people may want to tweak them for their data they were introduced as parameters.
I used the following, which is copied directly out of the log file for the associated DoubletDecon run:
filename: DX_100_1.2_4_3.results
location: /data/salomonis2/LabFiles/Erica-data/ICGS2_DX_and_CH/6719_DX_150_PC/
removeCC: FALSE
species: hsa
rhop: 1.2
write: FALSE
PMF: TRUE
useFull: FALSE
heatmap: FALSE
centroids: TRUE
num_doubs: 100
only50: FALSE
min_uniq: 4
This run used all of the default parameters, with the exception of write=FALSE (saving space!), though other parameter combinations were tested for the paper to assess accuracy of the software when the parameters where shifted. For most cases, the defaults will suffice. More information of the parameters and what they mean can be found in the documentation and the protocol.
Once you've gotten to this point, running DoubletDecon is simple! You can use the code below for the final results of this dataset:
location="/Users/DePasquale/Documents/DoubletDecon/Demuxlet/"
results=Main_Doublet_Decon(rawDataFile = paste0(location, "Demuxlet_example_expression"), groupsFile = paste0(location, "Demuxlet_example_groups"), filename="Demuxlet_example", location=location, fullDataFile=NULL, removeCC=FALSE, species="hsa", rhop=1.2, write=FALSE, PMF=TRUE, useFull=FALSE, heatmap=FALSE, centroids=TRUE, num_doubs=100, only50=FALSE, min_uniq=4, nCores=-1)
Note: the 'nCores' parameter was added later and was not included in the above log file.
Now that you have the results of DoubletDecon you likely want to know what to make of them. The most important file for you will be the "Final_doublets_groups" file, which will tell you which cells were removed as doublets. Look for an output with this prefix if write=TRUE, or access it directly within the 'results' R object using the following code:
results[["Final_doublets_groups"]]
This file, when viewed in Excel, will have a column of cell names first followed by two columns of grouping information — both numerical and expanded descriptions of the simulated doublet cluster each cell was most similar to:
Three additional files — "Final_nondoublets_groups", "Final_doublets_exp", and "Final_nondoublets_exp" — will also be helpful when trying to use a "cleaned" (a.k.a. doublet free) expression file for use in other bioinformatics pipelines. "Final_nondoublets_groups" is the same as "Final_doublets_groups", but only contains cells that were NOT predicted as doublets. The "Final_doublets_exp" and "Final_nondoublets_exp" are ICGS formatted expression files for the doublets and singlets, respectively:
To use this expression file with non-ICGS tools, simply delete the 'column_clusters-flat' row and the 'row_clusters-flat' column. At this time DoubletDecon only gives "cleaned" versions of the processed expression file (rawDataFile parameter) and not the full expression file (fullDataFile parameter), however, the "Final_nondoublets_groups" file will allow you to accomplish this using an alternative method.
Now that you have the results, it is important to perform another sanity check. In this case, we know that we are expecting to see 21.9% doublets because this dataset has biologically-validated doublet calls. Our results are in line with that prediction, so it is easy to see that DoubletDecon did not overcall doublets in this case. But most datasets will not have this information, and if it did you likely wouldn't be using doublet prediction software! Be aware of how the sequencing was performed, particularly if a lane is "overloaded", which may lead to a high percentage of doublets. The best resource for understanding how to determine the approximate percentage of doublets in your dataset is in the DoubletFinder paper, which is another doublet detection method. Their tool takes as a parameter the expected doublet percentage and they therefore dedicate time to showing the user how to figure this out themselves.
In the end we want to make sure that DoubletDecon isn't calling 50% of your dataset as doublets when only 10% is expected. If this happens, using the only50=TRUE parameter leads to more conservative doublet estimates. Additionally, the optional steps described in the DoubletDecon protocol will help reduce your results to only high confidence doublets through either multiple runs of DoubletDecon or in combination with other doublet detection tools.
On a final note, because I am often asked the question, the UMAP figures seen in the original paper and in the protocol (Figure 2) showing predicted doublets and non-doublets denoted by color can be generated using the user interface version of DoubletDecon, the 'umap' package in R, or the umap function in Seurat with the group labels changed to "0" and "1", for singlet and doublet, respectively.
Thank you for using this vignette! If you have any questions, please do not hesitate to contact me through GitHub or at me emails: [email protected] or [email protected].