-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathLRASms_HMP.Rmd
449 lines (382 loc) · 20.7 KB
/
LRASms_HMP.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
---
title: "DADA2 + PacBio: HMP Mock Community"
author: "Benjamin J Callahan"
output: html_document
---
## Data
The mock community being sequenced is the BEI Resources Microbial Mock Community B (Staggered, Low Concentration) community: https://www.beiresources.org/Catalog/otherProducts/HM-783D.aspx. This mock community (also known as the HMP mock community) contains the following bacterial in the following table, adapted from Table 1 in the BEI document, with alternative strain names and 16S copy numbers added based on the NCBI entry.
Organism | NCBI Reference Sequence | AKA | 16S copies
---------|------------------------------------
Acinetobacter baumannii, strain 5377 | NC_009085 | ATCC 17978 | NA (NCBI Seq Removed)
Actinomyces odontolyticus, strain 1A.21| NZ_AAYI02000000 | ATCC 17982 | 2
Bacillus cereus, strain NRS 248 | NC_003909 | ATCC 10987 | 12
Bacteroides vulgatus, strain ATCC® 8482 | NC_009614 | ATCC 8482 | 7
Clostridium beijerinckii, strain NCIMB 8052 | NC_009617 | NCIMB 8052 | 14
Deinococcus radiodurans, strain R1 (smooth) | NC_001263, NC_001264 | Unknown | NA+0
Enterococcus faecalis, strain OG1RF | NC_017316 | Unknown | 4
Escherichia coli, strain K12, substrain MG1655 | NC_000913 | Unknown | NA
Helicobacter pylori, strain 26695 | NC_000915 | Unknown | NA
Lactobacillus gasseri, strain 63 AM | NC_008530 | ATCC 33323 | 6
Listeria monocytogenes, strain EGDe | NC_003210 | Unknown | NA
Neisseria meningitidis, strain MC58 | NC_003112 | Unknown | NA
Propionibacterium acnes, strain KPA171202 | NC_006085 | Unknown | 3
Pseudomonas aeruginosa, strain PAO1-LAC | NC_002516 | Unknown | NA
Rhodobacter sphaeroides, strain ATH 2.4.1 | NC_007493, NC_007494 | ATCC BAA-808 | NA+2
Staphylococcus aureus, strain TCH1516 | NC_010079 | USA300_TCH151 | 5
Staphylococcus epidermidis, FDA strain PCI 1200 | NC_004461 | ATCC 12228 | NA
Streptococcus agalactiae, strain 2603 V/R | NC_004116 | Unknown | NA
Streptococcus mutans, strain UA159 | NC_004350 | Unknown | NA
Streptococcus pneumoniae, strain TIGR4 | NC_003028 | ATCC BAA-334 | 4
*16S copy number is shown for genomes by the NCBI using method "Best-placed reference protein set; GeneMarkS+"*
> Acknowledgment for publications should read “The following reagent was obtained through BEI Resources, NIAID, NIH as part of the Human Microbiome Project: Genomic DNA from Microbial Mock Community B (Staggered, Low Concentration), v5.2L, for 16S rRNA Gene Sequencing, HM-783D.”
This sequencing data was generated by PacBio using their in-house 16S amplification and sequencing protocol: https://www.pacb.com/wp-content/uploads/Unsupported-Full-Length-16S-Amplification-SMRTbell-LibraryPreparation-and-Sequencing.pdf *(everying is the same as the processing for the Zymo mock community)*
The sequencing was performed on a Sequel, with the S/P1-C1.2 sequencing chemistry. Standard PacBio bioinformatics were performed, and consensus sequences were extracting using CCS 3.1.1 with default parameters, `minPasses=3` and `minPredictedAccuracy=0.999`:
`ccs --pbi --force --logLevel=DEBUG --numThreads=16 --minSnr=3.75 --minReadScore=0.65 --maxLength=7000 --minLength=10 --minPasses=3 --minZScore=-5 --maxDropFraction=0.34 --minPredictedAccuracy=0.999 subreads.bam ccs.bam`
## Setup
Load libraries, setup paths, prepare environment:
```{r init, warning=FALSE, message=FALSE}
library(dada2);packageVersion("dada2")
library(Biostrings); packageVersion("Biostrings")
library(ggplot2); packageVersion("ggplot2")
library(reshape2); packageVersion("reshape2")
library(gridExtra); packageVersion("gridExtra")
path <- "~/Desktop/LRAS/Data/BEI" # CHANGE ME to location of the fastq file
path.out <- "Figures/"
path.rds <- "RDS/"
fn <- file.path(path, "SO_BEI-stagLo-16S_CCS-99.9.fq.gz")
F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"
rc <- dada2:::rc
theme_set(theme_bw())
```
## Remove Primers and Filter
Remove primers and orient reads:
```{r primers}
nop <- file.path(path, "noprimers", basename(fn))
prim <- removePrimers(fn, nop, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE, verbose=TRUE)
```
Very little loss there even though primer indels aren't allowed currently.
Inspect length distribution.
```{r}
hist(nchar(getSequences(nop)), 100)
```
Very strong peak ~1450. Perhaps some secondary peake nearby, should carry through further sequencing.
Filter:
```{r}
filt <- file.path(path, "noprimers", "filtered", basename(fn))
track <- filterAndTrim(nop, filt, minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)
```
## Run DADA2
Dereplicate:
```{r}
drp <- derepFastq(filt, verbose=TRUE)
```
Learn errors:
```{r}
err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun) # seconds
```
Looks good.
Inspect errors:
```{r}
plotErrors(err)
```
Denoise:
```{r}
dd <- dada(drp, err=err, BAND_SIZE=32, multithread=TRUE)
dd
```
Read tracking:
```{r}
cbind(ccs=prim[,1], primers=prim[,2], filtered=track[,2], denoised=sum(dd$denoised))
```
Inspect clustering data.frame
```{r}
sq <- dd$sequence
dd$clustering[,-1]
```
Big jump from pval 10-126 to below the threshold at `OMEGA_A=10^40`.
Assign taxonomy:
```{r}
tax <- assignTaxonomy(dd, "~/tax/silva_nr_v128_train_set.fa.gz", multithread=TRUE)
tax[,"Genus"] <- gsub("Escherichia/Shigella", "Escherichia", tax[,"Genus"]) # Reformat to be compatible with other data sources
tax[,"Genus"] <- gsub("Clostridium_sensu_stricto_1", "Clostridium", tax[,"Genus"]) # Reformat to be compatible with other data sources
unname(tax)
```
Plot abundances:
```{r}
plot(dd$denoised)
plot(dd$denoised, log="y")
```
Broad range of abundances over 3 orders of magnitude (10-10000).
Check chiemras:
```{r}
bim <- isBimeraDenovo(dd, minFoldParentOverAbundance=3.5, multithread=TRUE)
# Higher MFPOA to avoid flagging intra-genomic variants
table(bim)
```
No chimeras!
```{r}
table(tax[,"Genus"])
```
Save processed objects for future analysis.
```{r}
saveRDS(dd, file.path(path.rds, "HMP_dd.rds"))
saveRDS(tax, file.path(path.rds, "HMP_tax_Silva128.rds"))
```
Reload processed data objects (can run code below from here in absence of input sequence data):
```{r readRDS}
dd <- readRDS(file.path(path.rds, "HMP_dd.rds"))
tax <- readRDS(file.path(path.rds, "HMP_tax_Silva128.rds"))
```
## Reference check againt nt
BLAST these 51 sequences against nt (https://blast.ncbi.nlm.nih.gov/Blast.cgi):
```{r}
## dada2:::pfasta(dd, id=tax[,6])
uniquesToFasta(dd, file.path(path.rds, "LRAS_HMP_ASVs.fa"), ids=paste0(tax[,"Genus"], ";size=", dd$denoised, ";"))
is.ec <- tax[,"Genus"] %in% "Escherichia"
uniquesToFasta(dd$clustering[is.ec,], file.path(path.rds, "HMP_Ecoli.fa"), ids=paste0("Ecoli", seq(sum(is.ec)), "_HMP"))
## BLAST against nt excluding uncultured/environmental sequences (checkbox)
```
Results of BLAST search on July 19, 2018 are recorded.
1. Exact match to Strep. mutans strains: NG8, UA159-FR, UA159, NN2025 DNA, ChDC YM216, ChDC YM97, ChDC YM64, ChDC YM220
2. Exact match to Staph. epidermis ATCC 12228
3. Exact match to Strep. mutans strains: UA159, ChDC YM89
4. Exact match to many (~40) Staph. epidermis strains.
5. Exact match to many (~20) Rhodobacter sphaeroides strains.
6. Exact match to Staph. epidermis strains: FDAARGOS, RP62A
7. *Two mismatches (consecutive) to Staph. epidermidis strains: FDAARGOS_161, 1457, PM221* Query: 1235/6 AA, Refs: GG
8. Exact match to Staph. epidermis strains: RP62A, ATCC 12228
9. Exact match to many (>100) E. coli strains.
10. Exact match to many (~50) Strep. agalactiae strains.
11. Exact match to many (>100) E. coli strains.
12. Exact match to many (~90) E. coli strains.
13. Exact match to many (>100) E. coli strains.
14. Exact match to many (~80) E. coli strains.
15. Exact match to many (~50) E. coli strains.
16. Exact match to Bacillus cereus strains: M3, LWYT0010, LWYT0301, FRI-35, PPB, ATCC 10987
17. Exact match to many (>100) Staph. aureus strains.
18. Exact match to many (99) Staph. aureus strains.
19. Exact match to Helicobacter pylori strains: dRdM2addM2, 26695-dR, 26695-dRdM1dM2, 26695-dRdM2, dRdM1
20. Exact match to many (~15) Staph. aureus strains.
21. Exact match to Lactobacillus gasseri strains: ATCC 33323, Floraactive 18911-2, ATCC 33323, ATCC 33323, ATCC 33323
22. Exact match to many (~70) Staph. aureus strains.
23. Exact match to many (~60) Staph. aureus strains.
24. Exact match to Neisseria meningitidis strains: FDAARGOS_210, LNP21362, H44/76, MC58
25. Exact match to Clostridium beijerinckii strains: DSM 6423, ATCC 35702, N7, NCIMB 8052
26. Exact match to many (>100) Pseudomonas aeruginosa strains.
27. Exact match to Clostridium beijerinckii strains: BAS/B3/I/124, MF28, NCIMB 14988, ATCC 35702, NCIMB 8052
28. Exact match to many (~95) Acinetobacter baumannii strains.
29. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
30. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
31. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
32. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
33. Exact match to Clostridium sp. strain: MF28 (unnamed, but 1-off from C. beijerinckii, 2-ff from 35702/8052)
34. *One mismatch to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052, JCM 8023*: Query 955T, Refs C
35. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
36. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
37. *One mismatch to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052, NCIMB 14988*: Query 231C, Ref T OR Query 1209T, Refs C
38. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
39. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052, NRRL B-598
40. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
41. Exact match to many (>100) Listeria monocytogenes strains.
42. Exact match to many (~30) Listeria monocytogenes strains.
43. Exact match to many (~15) Propionibacterium/Cutibacterium acnes strains. (P.acnes -> C.acnes name change)
44. Exact match to many (~50) Enterococcus faecalis strains.
45. Exact match to Bacteroides vulgatus ATCC 8482
46. Exact match to Bacillus cereus ATCC 10987
47. Exact match to Bacillus cereus ATCC 10987 (one mismatch in web BLAST due to not handling ambiguous nt code S in reference)
48. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052, JCM 8023
49. Exact match to many (~70) Listeria monocytogenes strains.
50. Exact match to Pseudomonas aeruginosa strains: ATCC 15692, PAO1OR, PAO1H2O, PAO1-VE13, PAO1-VE2, PAO581, c7447m
51. Exact match to many (~30) Listeria monocytogenes strains.
Store these results in R format:
```{r}
refhit.nt <- ifelse(seq_along(sq) %in% c(7, 34, 37), FALSE, TRUE)
which(!refhit.nt)
```
48/51 of these are exact matches to sequenced and cultured isolates, over their entire ~1.5Kb length!
*Actinomyces odontolyticus*, *Deinococcus radiodurans* and *Streptococcus pneumoniae* were expected members of this mock community that were not detected in the data.
Let's deconvolute a couple strains by assigning species based on BLAST results to genera with multiple species (*Staphylococcus* and *Streptococcus*):
```{r}
tax[c(1,3),"Genus"] <- "Streptococcus mutans"
tax[10,"Genus"] <- "Streptococcus agalactiae"
tax[c(2,4,6,7,8),"Genus"] <- "Staphylococcus epidermis"
tax[c(17,18,20,22,23),"Genus"] <- "Staphylococcus aureus"
```
## Reference check against 16S references
BEI does not provide reference sequences for these strains. However, Robert Edgar has generated a set of references sequences as best he can based on publicy deposited genomes, and we will use that here:
```{r}
fn16S <- file.path("Docs/BEI_Reference_Edgar.fa")
refs <- getSequences(fn16S)
unique(sapply(strsplit(names(refs), "\\s"), function(x) paste(x[2], x[3])))
```
Twenty-two species, whereas only twenty expected in this HMP mock. The extra taxa are *Methanobrevibacter smithii* and *Porphyromonas gingivalis*, otherwise everything matches expectations.
Name the reference sequences something useful and count the number of unique (I think) variants in each:
```{r}
ref.16S <- getSequences(fn16S)
genus.spc <- sapply(strsplit(names(ref.16S), "\\s"), function(x) paste(x[2], x[3], sep="_"))
variant <- sapply(seq_along(genus.spc), function(i) sum(genus.spc[1:i] == genus.spc[[i]]))
names(ref.16S) <- paste(genus.spc, variant, sep="_")
nunq.ref <- table(genus.spc)
nunq.ref
table(tax[,"Genus"])
```
Now compare to DADA2 inferred sequences to the reference sequences, allowing no mismatches (but with `fixed=FALSE` so as to match ambiguous nucleotides in the references):
```{r}
sq <- dd$sequence
refmat.16S <- sapply(DNAStringSet(sq), vcountPattern, subject=DNAStringSet(ref.16S), fixed=FALSE)
rownames(refmat.16S) <- names(ref.16S)
colnames(refmat.16S) <- paste0("Seq", seq_along(sq))
```
Inspect:
```{r}
refhit.16S <- colSums(refmat.16S)>0
which(!refhit.16S)
table(tax[,"Genus"], Reference.Match=refhit.16S)
```
So 43/51 exactly match against this 16S reference based on the described strains. The 8 that don't hit include 3/5 *S. epidermis* sequence variants, 1/5 *S. aureus* variants, 1/1 *H. pylori* variant, and 3/15 *C. bejerenckii* variants. They overlap with 1/5 *S. epidermis* variants, and 2/15 *C. bejerenckii* variants that also were not matched against nt.
## Intra-genomic Stoichiometry
Now we'll investigate the "stoichiometry" of the sequence variants from each bacterial strain, i.e. the copy number of each 16S sequence variants relative to the genomic copy number, all inferred from the dada2 processed data and the rRNA gene copy numbers given for each strain in the Zymo reference.
These 16S copy numbers are based on (1) the NCBI reference sequence annotation if present, (2) an estimate from https://rrndb.umms.med.umich.edu/search/, and (3) inspection of the number of 16S rRNA genes in the best BLAST hits of each strain's ASVs. If copy number was ambiguous, the best BLAST hits were chosen to break the tie, as likely to most closely capture the actual strain being interrogated here.
```{r}
ncopy <- c("Acinetobacter" = 6,
"Actinomyces"=2,
"Bacillus" = 12,
"Bacteroides" = 7,
"Clostridium" = 14,
"Deinococcus" = 3,
"Enterococcus" = 4,
"Escherichia" = 7,
"Helicobacter" = 2,
"Lactobacillus" = 6,
"Listeria" = 6,
"Neisseria" = 4,
"Propionibacterium" = 3,
"Pseudomonas" = 4,
"Rhodobacter" = 4,
"Staphylococcus aureus" = 6,
"Staphylococcus epidermis" = 6,
"Streptococcus agalactiae" = 7,
"Streptococcus mutans" = 6,
"Streptococcus pneumoniae" = 4)
```
First calculate the genomic abundances, by summing over all sequence variants for each genus and dividing by the 16S copy number:
```{r}
gens <- sort(unique(tax[,"Genus"])) # Just the ones that were detected
abund.ome <- sapply(gens, function(gen) {
is.gen <- grepl(gen, tax[,"Genus"])
sum(dd$denoised[is.gen])/ncopy[gen] ###!
### sum(dd$clustering$n0[is.gen])/ncopy[gen] ###! Alternative w/o error-correction
})
names(abund.ome) <- gens
dfgen <- data.frame(Genus=gens, Abundance=abund.ome, stringsAsFactors = FALSE)
p.abundome <- ggplot(data=dfgen, aes(x=Genus, y=Abundance)) +
theme(axis.text.x=element_blank(), axis.ticks.x = element_blank()) +
ylab("Genome Abundance") + xlab("Strains")
#p.abundome + geom_col(width=0.4, aes(fill=Genus)) # scale_fill_manual(values=genusPalette) +
p.abundome+ geom_point(aes(color=Genus), shape="diamond", size=3) +
scale_y_log10(limits=c(0.3, NA), breaks=c(1,10,100,1000))
#ggsave(file.path(path.out, "HMP_GenomeAbundances.pdf"), p.abundome.nice,
# width=6, height=4, units="in", useDingbats=FALSE)
```
Now we make a similar plot for the abundance of each sequence variant:
```{r}
dfasv <- data.frame(Genus=tax[,"Genus"],
Abundance=dd$denoised, ###!
### Abundance=dd$clustering$n0, ###! Alternative w/o error-correction
stringsAsFactors = FALSE)
rownames(dfasv) <- NULL
p.abundasv <- ggplot(data=dfasv, aes(x=Genus, y=Abundance)) +
geom_point(aes(color=Genus), shape="x", size=4) +
ylim(c(0, NA)) +
theme(axis.text.x=element_blank(), axis.ticks.x = element_blank()) +
ylab("ASV Abundance")
p.abundasv
p.abundasv + ylim(0, 500)
summary(dfasv$Abundance)
summary(dfasv$Abundance)/sum(dfasv$Abundance)
```
Yep, broad range of abundances over 3 orders of magnitude at the genome and sequence-variant level. Concentrations of DNA from each strain, as assyaed by Qubit, are in the BEI Certificat of Analysis doc:
```{r}
qubit <- c("Acinetobacter" = 0.010,
"Actinomyces"=0.001,
"Bacillus" = 0.045,
"Bacteroides" = 0.001,
"Clostridium" = 0.044,
"Deinococcus" = 0.001,
"Enterococcus" = 0.001,
"Escherichia" = 0.680,
"Helicobacter" = 0.009,
"Lactobacillus" = 0.003,
"Listeria" = 0.005,
"Neisseria" = 0.006,
"Propionibacterium" = 0.009,
"Pseudomonas" = 0.161,
"Rhodobacter" = 1.413,
"Staphylococcus aureus" = 0.059,
"Staphylococcus epidermis" = 0.513,
"Streptococcus agalactiae" = 0.032,
"Streptococcus mutans" = 0.417,
"Streptococcus pneumoniae" = 0.001)
```
Note that the dropouts (*S. pneumoniae*, *Deinococcus* and *Actinomyces*) all had the lowest Qubit measurements (0.001), which may be a minimum assigned value rather than a true value anyway (e.g. ND). These were too rare to detect.
```{r}
plot(qubit[gens], abund.ome[gens], log="xy")
```
Decent correspondence.
And now we make the "stoichiometry" figure, i.e the abundance of each ASV scaled to the genomic abundance:
```{r}
dfasv$ScaledAbundance <- dfasv$Abundance/abund.ome[dfasv$Genus]
# Number the ASVs in each strain/genus
dfasv$Variant <- sapply(seq(nrow(dfasv)), function(i) sum(dfasv$Genus[1:i] == dfasv$Genus[[i]], na.rm=TRUE))
p.stoich <- ggplot(data=dfasv, aes(x=Variant, y=ScaledAbundance, fill=Genus, width=0.5)) + geom_col() +
facet_wrap(~Genus, nrow=3) +
scale_y_continuous(breaks=seq(0, round(max(dfasv$ScaledAbundance))), minor_breaks=NULL) +
theme(panel.grid.major.y=element_line(color="black", size=0.25)) +
theme(panel.grid.major.x=element_blank(), panel.grid.minor.x=element_blank()) +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
xlab("Full-length 16S Sequence Variants") +
ylab("Abundance (per-genome)")
p.stoich
ggsave(file.path(path.out, "HMP_Stoich.pdf"), width=12, height=7, units="in")
```
Now let's put the genome abundances together with the integers in a simpler plot:
```{r}
p.abundome.nice <- p.abundome + geom_point(aes(color=Genus), shape="diamond", size=2.8) +
scale_y_log10(limits=c(0.3, NA), breaks=c(1,10,100,1000)) +
theme(text = element_text(size=9)) +
theme(legend.text=element_text(size=7.5), legend.key.height = unit(3.8, "mm"), legend.title=element_blank())
p.stoich.nice <- ggplot(data=dfasv, aes(x=Genus, y=ScaledAbundance, color=Genus, width=0.5)) +
geom_point(shape="x", size=3.5) +
scale_y_continuous(breaks=seq(0, round(max(dfasv$ScaledAbundance))), minor_breaks=NULL) +
theme(text = element_text(size=9)) +
theme(panel.grid.major.y=element_line(color="grey60", size=0.2)) +
theme(panel.grid.major.x=element_blank(), panel.grid.minor.x=element_blank()) +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
xlab("Full-length 16S Sequence Variants") +
ylab("ASV Abundance (per-genome)")
p.stoich.nice
p.nice <- arrangeGrob(p.abundome.nice, p.stoich.nice + guides(color=FALSE), nrow=1, widths=c(0.6, 0.4))
plot(p.nice)
ggsave(file.path(path.out, "HMP_AbundanceAndStoich.pdf"), p.nice,
width=178, height=69, units="mm", useDingbats=FALSE)
```
*B. cereus* a little off, and *Listeria* a little off. Others all look very good (although many are just one variant, so trivial integral-ness).
Some is due to low numbers as well, e.g *Listeria* seems largely down to counting noise:
```{r}
is.gen <- tax[,"Genus"] == "Listeria"
foo <- data.frame(denoised=dd$denoised[is.gen], n0=dd$clustering$n0[is.gen], exact=drp$uniques[dd$sequence[is.gen]])
rownames(foo) <- NULL
foo
```
Diff between 1/2 increases a lot as reads are corrected. Might just be noise though, not huge numbers here.
However *Bacillus* is difficult to explain by counting noise. Could the rrn copy number be wrong? All three variants exactly match previously observed sequences so these ASVs being uncorrected sequencing errors seems unlikely.
For completeness, let's calculate the residual error rate if those low-frequency *B. cereus* ASVs are errors:
```{r residual}
is.bc <- tax[,"Genus"] %in% "Bacillus"
sq.bc <- dd$sequence[is.bc]
# Candidate errors are the 2nd and 3rd B cereus ASVs
tot.err <- sum(nwhamming(sq.bc[[1]], sq.bc[2:3])*dd$denoised[sq.bc[2:3]])
tot.nt <- sum(dd$denoised * nchar(dd$sequence))
tot.err/tot.nt
```