Update to use test.genes= in new trainSingleR() call.

SingleR-inc · Sep 7, 2024 · 3b17536 · 3b17536
1 parent 2b50a8d
commit 3b17536
Show file tree

Hide file tree

Showing 2 changed files with 6 additions and 14 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: SingleRBook
 Title: The Book of SingleR
-Version: 1.15.0
-Date: 2023-11-29
+Version: 1.15.1
+Date: 2024-09-06
 Authors@R: person('Aaron', 'Lun', role = c('aut', 'cre'), email="[email protected]")
 Description: 
     Comprehensive guide to using the SingleR Bioconductor package

diff --git a/inst/book/advanced.Rmd b/inst/book/advanced.Rmd
@@ -37,20 +37,18 @@ sce <- TENxPBMCData("pbmc3k")
 counts(sce) <- as(counts(sce), "dgCMatrix")
 ```
 
-We use the `trainSingleR()` function to do all the necessary calculations 
-that are independent of the test dataset.
-(Almost; see comments below about `common`.)
+We use the `trainSingleR()` function to do all the necessary calculations that are independent of the test dataset.
 This yields a list of various components that contains all identified marker genes
 and precomputed rank indices to be used in the score calculation.
 We can also turn on aggregation with `aggr.ref=TRUE` (Section \@ref(pseudo-bulk-aggregation))
 to further reduce computational work.
+Note that we need the identities of the genes in the test dataset (hence, `test.genes=`) to ensure that our chosen markers will actually be present in the test.
 
 ```{r}
-common <- intersect(rownames(sce), rownames(dice))
-
 library(SingleR)
 set.seed(2000)
-trained <- trainSingleR(dice[common,], labels=dice$label.fine, aggr.ref=TRUE)
+trained <- trainSingleR(dice, labels=dice$label.fine, 
+    test.genes=rownames(sce), aggr.ref=TRUE)
 ```
 
 We then use the `trained` object to annotate our dataset of interest through the `classifySingleR()` function.
@@ -73,12 +71,6 @@ identical(pred$labels, direct$labels)
 stopifnot(identical(pred$labels, direct$labels))
 ```
 
-The big caveat is that the universe of genes in the test dataset must be a superset of that the reference.
-This is the reason behind the intersection to `common` genes and the subsequent subsetting of `dice`.
-Practical use of preconstructed indices is best combined with some prior information about the gene-level annotation;
-for example, we might know that we always use a particular version of the Ensembl gene models,
-so we would filter out any genes in the reference dataset that are not in our test datasets.
-
 ## Parallelization
 
 Parallelization is an obvious approach to increasing annotation throughput.