Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanation of choice to model Nextstrain clades #266

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

afmagee42
Copy link

This PR attempts to summarize, briefly, the reasons we chose Nextstrain clades as the modeling unit, rather than Pango lineages.

The topic of aggregating lineages can spiral quickly, so I tried to err on the side of brevity there. If we need more, more can be said.

Also happy for this to be moved around (it has several forward-references, so it may belong later).

@afmagee42
Copy link
Author

cc @nickreich, @dylanhmorris

Copy link
Member

@nickreich nickreich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! A few minor suggestions included here.

README.md Outdated Show resolved Hide resolved
README.md Outdated
[Nextstrain clades](https://nextstrain.org/blog/2021-01-06-updated-SARS-CoV-2-clade-naming) satisfy both requirements readily.
They intentionally follow larger-scale trends in SARS-CoV-2 evolution, and a suitably-sized selection of relevant clades can be obtained with a simple frequency cutoff.
The somewhat more commonly-encountered [Pango lineages](https://www.nature.com/articles/s41564-020-0770-5) describe variation at a much finer scale, and the selection of a suitable number of relevant modeling units is non-trivial.
Further, as both systems are inherently phylogenetic, there is [sufficient](https://raw.githubusercontent.com/nextstrain/ncov-clades-schema/master/clades.svg) correspondence [between them](https://next.nextstrain.org/nextclade/sars-cov-2) that it is possible to model Nextstrain clades but discuss results in terms of Pango lineages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth adding on parenthetically at the end of this sentence something like ", even if there is not always a perfect one-to-one alignment between the clade and lineage assignments for a single sequence."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this in principle, but I'm having some trouble seeing how to be handwavy enough here not to open up a(t least one) rather large can of worms requiring potentially a fair bit more text to explain.

In a perfect world there would be a many-to-one mapping such that every Pango lineage corresponds to exactly one Nextstrain clade (not a one-to-one mapping). But that quickly leads to the cans of worms that are nestedness of naming (which is why the perfect world is many-to-one) and "lineage assignments aren't data" (which is why we don't live in a perfect world), and if we're not careful the pain that is the ARG.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we perfectly knew the evolutionary relationships between all the sequences (and, following Pango's approach, starting new trees for every recombination event), whether or not we get a clean many : one mapping of Pango lineage : Nextstrain clade comes down to whether every node which defines a more specific label in one system does so in the other.

(For example, if the node that carves 24B out of 24A is also the node that carves JN.1.11.1 from JN.1.11, things line up cleanly, despite the fact that 24A corresponds to JN.1. All the rest of JN.1 names will map to 24A, while names in JN.1.11.1 will map to 24B. Otherwise there will be slop. If the node that carves 24B out of 24A is the parent of (or any other node basal to) the node that carves JN.1.11.1 from JN.1.11, then JN.1.11 corresponds to both 24A and 24B. If it's the child (or other tip-ward descendant node), JN.1.11.1 (rather than JN.1.11) will correspond to both 24B and 24A.)

But in reality, there is circularity and lots of conditional inference-dependent inference. We don't actually observe the relationships, we estimate them. And we don't observe labels, we assign (infer, really) them, conditioned on both some estimate of the relationships (a phylogeny) and some inference of the labels (the already assigned Pango lineages and nextstrain clades to the sequences in the tree). That's a lot of additional room for mismatch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps something like one of these? (Which add load-bearing phrases to accompany the heavy lifting done by "sufficient correspondence.")

  1. Further, as both systems are inherently phylogenetic, there is sufficient correspondence between them that it is broadly possible to model Nextstrain clades but discuss results in terms of Pango lineages (even if, at the level of the assignment of individual sequences, the correspondence appears somewhat murkier).
  2. Further, as both systems are inherently phylogenetic, there is sufficient correspondence between them that it is broadly possible to model Nextstrain clades but discuss results in terms of Pango lineages (even if the mappings are not perfectly clean when looking at sequence-level assignments of clades and lineages).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested just adding "typically possible", which I think captures it. The point about single-sequence-level classification is helpful context, but I'd suggest relegating it to a footnote.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like "typically possible" and I like that a footnote affords a bit more space for clarity in this aside. I have done so.

README.md Outdated Show resolved Hide resolved
Copy link

@dylanhmorris dylanhmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @afmagee42! A few questions and a few wording suggestions.

README.md Outdated Show resolved Hide resolved
Comment on lines +98 to +99
To designate, weekly, a list of variants to model, it is ideal to have a system that does not require human intervention.
Further, this system must consistently produce a reasonable number of variants to permit the submission of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'd lead with the requirement and then add the nice-to-have.
  2. Possibly worth saying what constitutes a "reasonable number" and why?
Suggested change
To designate, weekly, a list of variants to model, it is ideal to have a system that does not require human intervention.
Further, this system must consistently produce a reasonable number of variants to permit the submission of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).
The Hub must define a list of variants to model each week. This system should consistently produce a reasonable number of distinct variants as modeling targets. Ideally, it should be algorithmic and operate without human intervention. An algorithmic approach makes the choice of targets more transparent and simplifies Hub administration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, re, ordering, and I do like the explanation of why algorithmic is good.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, accounting for this and the size bit, we get this?

Suggested change
To designate, weekly, a list of variants to model, it is ideal to have a system that does not require human intervention.
Further, this system must consistently produce a reasonable number of variants to permit the submission of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).
The Hub must define a list of variants to model each week.
This system should consistently produce a reasonable number of distinct variants as modeling targets.
(Experimentation showed this to be approximately 10 variants or fewer in order to accommodate a sufficient number of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).)
Ideally, it should be algorithmic and operate without human intervention.
An algorithmic approach makes the choice of targets more transparent and simplifies Hub administration.

README.md Outdated Show resolved Hide resolved
Further, this system must consistently produce a reasonable number of variants to permit the submission of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).
[Nextstrain clades](https://nextstrain.org/blog/2021-01-06-updated-SARS-CoV-2-clade-naming) satisfy both requirements readily.
They intentionally follow larger-scale trends in SARS-CoV-2 evolution, and a suitably-sized selection of relevant clades can be obtained with a simple frequency cutoff.
The somewhat more commonly-encountered [Pango lineages](https://www.nature.com/articles/s41564-020-0770-5) describe variation at a much finer scale, and the selection of a suitable number of relevant variants is non-trivial.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps give an example here?

Suggested change
The somewhat more commonly-encountered [Pango lineages](https://www.nature.com/articles/s41564-020-0770-5) describe variation at a much finer scale, and the selection of a suitable number of relevant variants is non-trivial.
[Pango lineages](https://www.nature.com/articles/s41564-020-0770-5) are another approach to naming SARS-CoV-2 variants. By design, they describe both coarse-scale and fine-scale virus evolution. Selecting an appropriate set of non-overlapping Pango lineages is less straightforward, as it requires first determining an appropriate scale of variation for modeling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I'm doing a lot of hand-waving here, but as mentioned above, we don't want to require an entire explainer of lineage aggregation just to make this point.

To that end, I'm also:

  • Iffy on diving into the lower-and-higher-order nature of naming schemes. Nested names are messy, but both Nextstrain clades and Pango lineages do it.
  • Not sold on "non-overlapping" because when you read lists of aggregated Pango lineages, or our list of clades to model, implicit paraphyly is pretty much guaranteed (and it is implicit).

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants