Explanation of choice to model Nextstrain clades #266

afmagee42 · 2025-01-10T22:01:02Z

This PR attempts to summarize, briefly, the reasons we chose Nextstrain clades as the modeling unit, rather than Pango lineages.

The topic of aggregating lineages can spiral quickly, so I tried to err on the side of brevity there. If we need more, more can be said.

Also happy for this to be moved around (it has several forward-references, so it may belong later).

afmagee42 · 2025-01-10T22:01:41Z

cc @nickreich, @dylanhmorris

nickreich

Thanks for putting this together! A few minor suggestions included here.

README.md

nickreich · 2025-01-10T22:43:12Z

README.md

+[Nextstrain clades](https://nextstrain.org/blog/2021-01-06-updated-SARS-CoV-2-clade-naming) satisfy both requirements readily.
+They intentionally follow larger-scale trends in SARS-CoV-2 evolution, and a suitably-sized selection of relevant clades can be obtained with a simple frequency cutoff.
+The somewhat more commonly-encountered [Pango lineages](https://www.nature.com/articles/s41564-020-0770-5) describe variation at a much finer scale, and the selection of a suitable number of relevant modeling units is non-trivial.
+Further, as both systems are inherently phylogenetic, there is [sufficient](https://raw.githubusercontent.com/nextstrain/ncov-clades-schema/master/clades.svg) correspondence [between them](https://next.nextstrain.org/nextclade/sars-cov-2) that it is possible to model Nextstrain clades but discuss results in terms of Pango lineages.


Maybe worth adding on parenthetically at the end of this sentence something like ", even if there is not always a perfect one-to-one alignment between the clade and lineage assignments for a single sequence."

I like this in principle, but I'm having some trouble seeing how to be handwavy enough here not to open up a(t least one) rather large can of worms requiring potentially a fair bit more text to explain.

In a perfect world there would be a many-to-one mapping such that every Pango lineage corresponds to exactly one Nextstrain clade (not a one-to-one mapping). But that quickly leads to the cans of worms that are nestedness of naming (which is why the perfect world is many-to-one) and "lineage assignments aren't data" (which is why we don't live in a perfect world), and if we're not careful the pain that is the ARG.

If we perfectly knew the evolutionary relationships between all the sequences (and, following Pango's approach, starting new trees for every recombination event), whether or not we get a clean many : one mapping of Pango lineage : Nextstrain clade comes down to whether every node which defines a more specific label in one system does so in the other.

(For example, if the node that carves 24B out of 24A is also the node that carves JN.1.11.1 from JN.1.11, things line up cleanly, despite the fact that 24A corresponds to JN.1. All the rest of JN.1 names will map to 24A, while names in JN.1.11.1 will map to 24B. Otherwise there will be slop. If the node that carves 24B out of 24A is the parent of (or any other node basal to) the node that carves JN.1.11.1 from JN.1.11, then JN.1.11 corresponds to both 24A and 24B. If it's the child (or other tip-ward descendant node), JN.1.11.1 (rather than JN.1.11) will correspond to both 24B and 24A.)

But in reality, there is circularity and lots of conditional inference-dependent inference. We don't actually observe the relationships, we estimate them. And we don't observe labels, we assign (infer, really) them, conditioned on both some estimate of the relationships (a phylogeny) and some inference of the labels (the already assigned Pango lineages and nextstrain clades to the sequences in the tree). That's a lot of additional room for mismatch.

Perhaps something like one of these? (Which add load-bearing phrases to accompany the heavy lifting done by "sufficient correspondence.")

Further, as both systems are inherently phylogenetic, there is sufficient correspondence between them that it is broadly possible to model Nextstrain clades but discuss results in terms of Pango lineages (even if, at the level of the assignment of individual sequences, the correspondence appears somewhat murkier).

Further, as both systems are inherently phylogenetic, there is sufficient correspondence between them that it is broadly possible to model Nextstrain clades but discuss results in terms of Pango lineages (even if the mappings are not perfectly clean when looking at sequence-level assignments of clades and lineages).

I suggested just adding "typically possible", which I think captures it. The point about single-sequence-level classification is helpful context, but I'd suggest relegating it to a footnote.

I like "typically possible" and I like that a footnote affords a bit more space for clarity in this aside. I have done so.

README.md

dylanhmorris

Thanks, @afmagee42! A few questions and a few wording suggestions.

README.md

dylanhmorris · 2025-01-13T22:51:03Z

README.md

+To designate, weekly, a list of variants to model, it is ideal to have a system that does not require human intervention.
+Further, this system must consistently produce a reasonable number of variants to permit the submission of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).


I'd lead with the requirement and then add the nice-to-have.

Possibly worth saying what constitutes a "reasonable number" and why?

Suggested change

To designate, weekly, a list of variants to model, it is ideal to have a system that does not require human intervention.

Further, this system must consistently produce a reasonable number of variants to permit the submission of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).

The Hub must define a list of variants to model each week. This system should consistently produce a reasonable number of distinct variants as modeling targets. Ideally, it should be algorithmic and operate without human intervention. An algorithmic approach makes the choice of targets more transparent and simplifies Hub administration.

Fair enough, re, ordering, and I do like the explanation of why algorithmic is good.

Maybe, accounting for this and the size bit, we get this?

Suggested change

To designate, weekly, a list of variants to model, it is ideal to have a system that does not require human intervention.

Further, this system must consistently produce a reasonable number of variants to permit the submission of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).

The Hub must define a list of variants to model each week.

This system should consistently produce a reasonable number of distinct variants as modeling targets.

(Experimentation showed this to be approximately 10 variants or fewer in order to accommodate a sufficient number of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).)

Ideally, it should be algorithmic and operate without human intervention.

An algorithmic approach makes the choice of targets more transparent and simplifies Hub administration.

README.md

dylanhmorris · 2025-01-13T23:04:10Z

README.md

+Further, this system must consistently produce a reasonable number of variants to permit the submission of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).
+[Nextstrain clades](https://nextstrain.org/blog/2021-01-06-updated-SARS-CoV-2-clade-naming) satisfy both requirements readily.
+They intentionally follow larger-scale trends in SARS-CoV-2 evolution, and a suitably-sized selection of relevant clades can be obtained with a simple frequency cutoff.
+The somewhat more commonly-encountered [Pango lineages](https://www.nature.com/articles/s41564-020-0770-5) describe variation at a much finer scale, and the selection of a suitable number of relevant variants is non-trivial.


Perhaps give an example here?

Suggested change

The somewhat more commonly-encountered [Pango lineages](https://www.nature.com/articles/s41564-020-0770-5) describe variation at a much finer scale, and the selection of a suitable number of relevant variants is non-trivial.

[Pango lineages](https://www.nature.com/articles/s41564-020-0770-5) are another approach to naming SARS-CoV-2 variants. By design, they describe both coarse-scale and fine-scale virus evolution. Selecting an appropriate set of non-overlapping Pango lineages is less straightforward, as it requires first determining an appropriate scale of variation for modeling.

I know I'm doing a lot of hand-waving here, but as mentioned above, we don't want to require an entire explainer of lineage aggregation just to make this point.

To that end, I'm also:

Iffy on diving into the lower-and-higher-order nature of naming schemes. Nested names are messy, but both Nextstrain clades and Pango lineages do it.

Not sold on "non-overlapping" because when you read lists of aggregated Pango lineages, or our list of clades to model, implicit paraphyly is pretty much guaranteed (and it is implicit).

README.md

Co-authored-by: Dylan H. Morris <[email protected]>

why clades

77ead57

nickreich requested changes Jan 10, 2025

View reviewed changes

Cleaning up per review

742ce1d

dylanhmorris reviewed Jan 13, 2025

View reviewed changes

afmagee42 and others added 5 commits January 14, 2025 06:03

Update README.md

2b3b555

Co-authored-by: Dylan H. Morris <[email protected]>

Update README.md

51eb2fc

Co-authored-by: Dylan H. Morris <[email protected]>

don't need redundant last sentence any more

417a8be

Context in footnote

42c403c

apply modified suggestion

0d1fb0b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explanation of choice to model Nextstrain clades #266

Explanation of choice to model Nextstrain clades #266

afmagee42 commented Jan 10, 2025

afmagee42 commented Jan 10, 2025

nickreich left a comment

nickreich Jan 10, 2025

afmagee42 Jan 13, 2025

afmagee42 Jan 13, 2025

afmagee42 Jan 13, 2025

dylanhmorris Jan 13, 2025

afmagee42 Jan 14, 2025

dylanhmorris left a comment •

edited

Loading

dylanhmorris Jan 13, 2025

afmagee42 Jan 14, 2025

afmagee42 Jan 14, 2025

dylanhmorris Jan 13, 2025

afmagee42 Jan 14, 2025

		To designate, weekly, a list of variants to model, it is ideal to have a system that does not require human intervention.
		Further, this system must consistently produce a reasonable number of variants to permit the submission of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).

	To designate, weekly, a list of variants to model, it is ideal to have a system that does not require human intervention.
	Further, this system must consistently produce a reasonable number of variants to permit the submission of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).
	The Hub must define a list of variants to model each week. This system should consistently produce a reasonable number of distinct variants as modeling targets. Ideally, it should be algorithmic and operate without human intervention. An algorithmic approach makes the choice of targets more transparent and simplifies Hub administration.

-To designate, weekly, a list of variants to model, it is ideal to have a system that does not require human intervention.
-Further, this system must consistently produce a reasonable number of variants to permit the submission of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).
+The Hub must define a list of variants to model each week.
+This system should consistently produce a reasonable number of distinct variants as modeling targets.
+(Experimentation showed this to be approximately 10 variants or fewer in order to accommodate a sufficient number of [samples](#probabilistic-forecast-evaluation) of the frequency of every variant in each state for [six weeks](#prediction-horizon).)
+Ideally, it should be algorithmic and operate without human intervention.
+An algorithmic approach makes the choice of targets more transparent and simplifies Hub administration.

	The somewhat more commonly-encountered [Pango lineages](https://www.nature.com/articles/s41564-020-0770-5) describe variation at a much finer scale, and the selection of a suitable number of relevant variants is non-trivial.
	[Pango lineages](https://www.nature.com/articles/s41564-020-0770-5) are another approach to naming SARS-CoV-2 variants. By design, they describe both coarse-scale and fine-scale virus evolution. Selecting an appropriate set of non-overlapping Pango lineages is less straightforward, as it requires first determining an appropriate scale of variation for modeling.

Explanation of choice to model Nextstrain clades #266

Are you sure you want to change the base?

Explanation of choice to model Nextstrain clades #266

Conversation

afmagee42 commented Jan 10, 2025

afmagee42 commented Jan 10, 2025

nickreich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dylanhmorris left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dylanhmorris left a comment •

edited

Loading