Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanation of choice to model Nextstrain clades #266

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,17 @@ We will not solicit estimates for the US as a whole, in part because evaluating

Each week the hub designates up to nine NextStrain clades with the highest reported prevalence of at least 1% across the US in any of the three complete [USA/CDC epidemiological weeks](https://ndc.services.cdc.gov/wp-content/uploads/MMWR_Week_overview.pdf) (a.k.a. MMWR weeks) preceding the Wednesday submission date. Any clades with prevalence of less than 1% are grouped into an “other” category for which predictions of combined prevalence are also collected. No more than 10 clades (including “other”) are selected in a given week. For details on the workflow that generates this list each week, see the [clade list section](#clade-list) below.

#### Why use Nextstrain clades?

To designate a list of modeling targets weekly, it is ideal to have a system that does not require human intervention.
Further, this system must consistently produce a reasonable number of modeling units to permit the submission of [samples](#probabilistic-forecast-evaluation) of the frequency of every modeling unit in each state for [six weeks](#prediction-horizon).
afmagee42 marked this conversation as resolved.
Show resolved Hide resolved
[Nextstrain clades](https://nextstrain.org/blog/2021-01-06-updated-SARS-CoV-2-clade-naming) satisfy both requirements readily.
afmagee42 marked this conversation as resolved.
Show resolved Hide resolved
They intentionally follow larger-scale trends in SARS-CoV-2 evolution, and a suitably-sized selection of relevant clades can be obtained with a simple frequency cutoff.
afmagee42 marked this conversation as resolved.
Show resolved Hide resolved
The somewhat more commonly-encountered [Pango lineages](https://www.nature.com/articles/s41564-020-0770-5) describe variation at a much finer scale, and the selection of a suitable number of relevant modeling units is non-trivial.
Further, as both systems are inherently phylogenetic, there is [sufficient](https://raw.githubusercontent.com/nextstrain/ncov-clades-schema/master/clades.svg) correspondence [between them](https://next.nextstrain.org/nextclade/sars-cov-2) that it is possible to model Nextstrain clades but discuss results in terms of Pango lineages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth adding on parenthetically at the end of this sentence something like ", even if there is not always a perfect one-to-one alignment between the clade and lineage assignments for a single sequence."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this in principle, but I'm having some trouble seeing how to be handwavy enough here not to open up a(t least one) rather large can of worms requiring potentially a fair bit more text to explain.

In a perfect world there would be a many-to-one mapping such that every Pango lineage corresponds to exactly one Nextstrain clade (not a one-to-one mapping). But that quickly leads to the cans of worms that are nestedness of naming (which is why the perfect world is many-to-one) and "lineage assignments aren't data" (which is why we don't live in a perfect world), and if we're not careful the pain that is the ARG.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we perfectly knew the evolutionary relationships between all the sequences (and, following Pango's approach, starting new trees for every recombination event), whether or not we get a clean many : one mapping of Pango lineage : Nextstrain clade comes down to whether every node which defines a more specific label in one system does so in the other.

(For example, if the node that carves 24B out of 24A is also the node that carves JN.1.11.1 from JN.1.11, things line up cleanly, despite the fact that 24A corresponds to JN.1. All the rest of JN.1 names will map to 24A, while names in JN.1.11.1 will map to 24B. Otherwise there will be slop. If the node that carves 24B out of 24A is the parent of (or any other node basal to) the node that carves JN.1.11.1 from JN.1.11, then JN.1.11 corresponds to both 24A and 24B. If it's the child (or other tip-ward descendant node), JN.1.11.1 (rather than JN.1.11) will correspond to both 24B and 24A.)

But in reality, there is circularity and lots of conditional inference-dependent inference. We don't actually observe the relationships, we estimate them. And we don't observe labels, we assign (infer, really) them, conditioned on both some estimate of the relationships (a phylogeny) and some inference of the labels (the already assigned Pango lineages and nextstrain clades to the sequences in the tree). That's a lot of additional room for mismatch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps something like one of these? (Which add load-bearing phrases to accompany the heavy lifting done by "sufficient correspondence.")

  1. Further, as both systems are inherently phylogenetic, there is sufficient correspondence between them that it is broadly possible to model Nextstrain clades but discuss results in terms of Pango lineages (even if, at the level of the assignment of individual sequences, the correspondence appears somewhat murkier).
  2. Further, as both systems are inherently phylogenetic, there is sufficient correspondence between them that it is broadly possible to model Nextstrain clades but discuss results in terms of Pango lineages (even if the mappings are not perfectly clean when looking at sequence-level assignments of clades and lineages).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested just adding "typically possible", which I think captures it. The point about single-sequence-level classification is helpful context, but I'd suggest relegating it to a footnote.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like "typically possible" and I like that a footnote affords a bit more space for clarity in this aside. I have done so.

For example, Nextstrain clade 24A corresponds to Pango lineage JN.1, 22F to XBB, and 21L to BA.2.
afmagee42 marked this conversation as resolved.
Show resolved Hide resolved
For these reasons, we have chosen to target Nextstrain clades as modeling units.

### Prediction horizon

Genomic sequences tend to be reported weeks after being collected. Therefore, recent data is subject to quite a lot of backfill. For this reason, the hub collects "nowcasts" (predictions for data relevant to times prior to the current time, but not yet observed) and some "forecasts" (predictions for future observations). Counting the Wednesday submission date as a prediction horizon of zero, we collect daily-level predictions for 10 days into the future (the Saturday that ends the epidemic week after the Wednesday submission) and -31 days into the past (the Sunday that starts the epidemic week four weeks prior to the Wednesday submission date). Overall, six weeks (42 days) of predicted values are solicited each week.
Expand Down