Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can node labels be parsed as support values for tree rooting purposes? #53

Closed
jhcepas opened this issue May 17, 2016 · 25 comments
Closed

Comments

@jhcepas
Copy link

jhcepas commented May 17, 2016

Hi @jeetsukumaran and @mtholder!
I have recently expanded the analysis by Czech and Stamatakis regarding the incorrect mapping of support values after tree rooting in tree viewers, and found that multiple toolkits (perhaps also dendropy) might be affected.

@lczech, @stamatak and I are now working on reporting/confirming the bug in more frameworks and tree viewers.

Current tests are here:
https://github.com/jhcepas/test_branch_support_after_tree_rerooting

regarding dendropy, I don't know why the rooted topology returned looks as unrooted in the image. Is that normal? In any case, it seems that the support values are not mapped correctly (i.e. the support value of 20 does not correspond to the same partition as in the original tree).

Could you take a look and confirm that this is not a problem in our tests?
tx!

@jeetsukumaran
Copy link
Owner

jeetsukumaran commented May 17, 2016

Jaime (and @mtholder, @lczech, @stamatak)

(1) As documented, DendroPy assumes unrooted unless noted otherwise.
You will need a rooting token ("[&R]") or otherwise inform DendroPy at
parse time that the tree is rooted:

tree = Tree.get(
         data=tree_str,
         schema="newick",
         rooting="force-rooted",
         )
outgroup_node = tree.find_node_with_taxon_label("X")
tree.to_outgroup_position(outgroup_node, update_bipartitions=False)
print(tree.as_string(schema='newick').strip())

(2) Note that in example you give, in no way is DendroPy doing ANY
support mapping at all. It is simply retaining the labels associated
with the node that they were originally associated with. Calling this
"incorrect" is, in fact, incorrect. To DendroPy, these are just labels,
and, moreover, labels associated with nodes and NOT bipartitions. When
rerooting trees, this association is preserved simply with no attempt at
remapping. There should be no expectation of anything different.

(2) The labels-on-nodes <=> support for bipartitions is fragile
relationship, especially when using Newick format. The assumed
equivalency fallacy between the two is the core of the issue. I think
that the issue here is not so much tree viewers, but tree viewer USERS.
Users need to be aware a node label is a node label, nothing more.
Rerooting the tree shuffles the nodes around, and they take their label
with them.

(3) DendroPy has a very rich and rigorous infrastructure for dealing
with bipartitions and support for bipartitions on trees. All this is
bypassed when labels on nodes are used. To DendroPy, a label on an node
is just a label on a node. The fact that it has become a convention to
sometimes express support values there is not just an artifact, but an
idiosyncratic artifact of the Newick format.

(4) To do this operation correctly in DendroPy, after the tree has been
parsed, (a) the bipartition hashes need to be calcualted; (b) the node
labels need to be interpreted and assigned to the bipartitions; (c) the
tree re-rooted; (d) the support values recalculated and expressed as
node value based the bipartition supports. There is no ready-to-go
function in the library to the do the above currently in DendroPy, as
their has not been any demand for it (all support value
mapping/manipulation is done on raw data through SumTrees and the
underlying TreeArray class). However, this function will be trivial to
write, and I can sketch a simple version up if you want.

(6) This will still, however, result in null/None value at the original
old root, if the original program did not assign a support label there,
or an incorrect value if the original program assigned a support value
of 1.0. There is no way to get the support value for this original root
without re-scoring the support.

(7) The reason for the above is what I think is a pervasive fallacy
(IMHO) with the way people think about rooting. Most people think that
re-rooting tree is a simple visual operation. In fact, it is a
model-changing operation. Just like rooting an unrooted tree is a change
of the tree model. In both cases by use of outside information rather
than anything in the data or by recourse to statistical procedure.
Sometimes the support values as scored under the original tree model can
be carried across without modification. Because of this, and the
ease/seamlessness that tree viewers do this, folks do not think
re-rooting/rooting is a statistically-meaningful operation, considering
it mostly aesthetic. I disagree.

(8) Of course, there is no reason that tree viewers need not up their
game and change to conform to user's expectations: i.e., consider the
node labels as mapping to a bipartition rather than a node, as I
described when outlining how to do this operation correctly with DendroPy .

-- jeet

On 5/17/16 10:44 AM, Jaime Huerta-Cepas wrote:

Hi @jeetsukumaran https://github.com/jeetsukumaran and @mtholder
https://github.com/mtholder!
I have recently expanded the analysis by Czech and Stamatakis regarding
the incorrect mapping of support values after tree rooting
http://biorxiv.org/content/early/2015/12/25/035360 in tree viewers,
and found that multiple toolkits (perhaps also dendropy) might also be
affected.

@lczech https://github.com/lczech, @stamatak
https://github.com/stamatak and I are now working on
reporting/confirming the bug in more frameworks and tree viewers.

Current tests are here:
https://github.com/jhcepas/test_branch_support_after_tree_rerooting

regarding dendropy, I don't know why the rooted topology returned looks
as unrooted in the image. Is that normal? In any case, it seems that the
support values are not mapped correctly (i.e. the support value of 20
does not correspond to the same partition as in the original tree).

Could you take a look and confirm that this is not a problem in our tests?
tx!


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#53


Jeet Sukumaran

[email protected]

Blog/Personal Pages:
http://jeetworks.org/
GitHub Repositories:
http://github.com/jeetsukumaran
Photographs (as stream):
http://www.flickr.com/photos/jeetsukumaran/
Photographs (by galleries):

http://www.flickr.com/photos/jeetsukumaran/sets/

@mtholder
Copy link
Collaborator

Hi, I agree that, by default, text after a close parentheses should be treated as a text label of the node (as the newick standard says). But it would be nice to have an easy way to push a node property into an edge data struct in dendropy. (and then display the edge property on in the ascii tree). there may be an easy way, I'm not sure...

@jhcepas
Copy link
Author

jhcepas commented May 17, 2016

@jeetsukumaran @mtholder, I agree on every point.
I had the same problem in ETE and I included an option to parse those numbers as real support values, so they are not treated as labels and are correctly mapped to branches when re-rooting. I found it so pervasive that is the default method for reading newick strings in ETE.

@jeetsukumaran
Copy link
Owner

jeetsukumaran commented May 17, 2016

@mtholder, WRT to node vs edge labels. One of the big bugbearboo-boos with Newick. There is no shortage of ways if you are willing to come up and use with YAINFH (Yet Another Idiosyncratic Newick Format Hack). Anything from a symbol, such as a hash mark to indicate the label should be associated with the subtending edge rather than node, or something using comment metadata.

Or, serialize using a format that explicitly supports distinct expression of edge vs. node labels, such as NeXML, etc.

@lczech
Copy link

lczech commented May 17, 2016

Wow, lot of interesting discussion already going on here, nice. It seems all of us had their problems with the Newick format... Let's try to fix this!

According to Phylip (which I consider the "standard" definition), there is no such thing as a "node label" in Newick, so all of this is a convention anyway. Problematically, every tool adheres to their own interpretation. Thus, I disagree that it is the users responsibility to keep track of the different conventions being used by all the tools they use. Instead, the tools should offer proper support for the data.

That implies the following. If for certain data (e.g., labels in Newick) there is no way to unambiguously distinguish their meaning (i.e., node vs branch label), the tool should not simply assign a (preferred) interpretation. It should instead ask the user what semantics he or she actually expects. The newer versions of Dendroscope (since 3.5.0) already do this nicely. This is the Principle of Least Astonishment.

Furthermore, if a particular tool internally only offers one of the interpretations (i.e., the internal data structures just use node labels, but do not have branch labels), the tool should still tell the user what is happening. This way, errors are prevented, while implicitly educating users about the issue.

By the way, I like "YAINFH", @jeetsukumaran ;-)

@jeetsukumaran
Copy link
Owner

@jhcepas and @lczech

The luxury of a restricted set of use cases allows this approach: i.e. a default interpretation of those labels. However, DendroPy is a library rather than a tool. It has a very broad range of use cases. And those poor, bloated, used-and-abused, overdetermined Newick node-and-edge-simultaneously labels get used for a LOT of things apart from support values by lots of different people: internal node taxa, fossil calibration times, population sizes, trait states, etc. etc. Assuming one of these as a default is going to Astonish a whole lot of people. Even "support values" in the labels often are more complex than just numeric: e.g., many people have support values as established under multiple different approaches concatenated together in the labels. So, we default to treating the label as just a label, and provide the infrastructure to interpret these as needed. You are right that this needs to be clearly documented, though. And we do that! Typically in multiple places (e.g., the tutorials as well as library API documentation). The problem is, as I am sure you are aware, the documentation is rarely visited by users! Each of whom assumes that their use case is the only use case that makes sense and wants to library to work naturally (by default) based on their needs, perspectives, expectations, without taking into account the broader range of use cases that may impose very different natural defaults.

@jhcepas jhcepas changed the title Incorrect branch support values after phylogenetic tree rooting can node labels be parsed as support values for tree rooting purposes? May 17, 2016
@lczech
Copy link

lczech commented May 17, 2016

Hm, I don't fully understand what you want to say.

You state

Assuming one of these as a default is going to Astonish a whole lot of people.

which is exactly my point: don't assume - ask the user!

Further, you say

So, we default to treating the label as just a label, and provide the infrastructure to interpret these as needed.

This is IMO the best way to handle this. But, if I understood correctly, you still assume them to be node labels, right? That's the crux. I think, this also should be up to the user to decide - particularly because they often do not read the manual!

That means, when reading data in Newick, there should be a way to distinguish between the interpretations. AFAIK, ETE does this by offering a flag paramter in the reading function, right @jhcepas?! Another way would be to offer multiple reading functions, or (in languages like C++, where you have strong types), read into a data structure that has the desired kind of member variables.

This way, the user will automatically stumble upon the issue, because function parameters etc are parts of the manual that one needs to read in order to use the function.

So, in summary, I think we are on the same page. Our views just seem to differ on how to incorporate the "user choice" into software. What do you think?

Also, as a side comment, this topic would be more appropriately discussed in some bioinformatics forum, because it affects more then just DendroPy. This is not meant to be critique on your software, but a more general discussion. Hope you don't mind discussing this here.

@jhcepas
Copy link
Author

jhcepas commented May 17, 2016

@lczech yeap, we use a flag when loading trees, but by no means has to be the default in other toolkits.

I have changed the title of this issue, as I think we are discussing different things. The question here would be if/how dendropy can parse node labels as branch support values, as this is very common in some scenarios. I think I understood from @jeetsukumaran that there is, indeed, a way. (?)

@jeetsukumaran
Copy link
Owner

jeetsukumaran commented May 17, 2016

You state

Assuming one of these as a default is going to Astonish a whole lot
of people.

which is exactly my point: don't assume - ask the user!

I was referencing ETE interpreting the values as support values as default.

Also note that, as DendroPy is a programming library, not a tool,
there is no real way to ask the user. The user has to understand the API
sufficiently well to use the correct functions/arguments to get what
they want. And, as Python is Turing-complete, they can get anything they
want ... ;)

Further, you say

So, we default to treating the label as just a label, and provide

the infrastructure to interpret these as needed.

This is IMO the best way to handle this. But, if I understood correctly,
you still assume them to be node labels, right? That's the crux. I
think, this also should be up to the user to decide.

In a sense, the user can "decide" this, by simply writing:

for nd in tree:
     nd.edge.label = nd.label

But you are saying that we should offer the option of the above
association being available to the user at parse time of NEWICK.

Fair enough: it's easy to do. But it still needs a default, and
defaulting to assume the label is associated with the node is convention
that goes back a long way and is fairly pervasive.

Note that there is a difference between assuming a label is associated
with a node or edge vs. a label should be interpreted as a support
value, population size, or whatever.

The first is a syntactic assumption, and, if properly documented or at
least consistently enforced, can be be considered a format specification
whether. The role and responsibility for "reading" this lies with the
parser. The second is a semantic assumption. The role and
responsibility for this, as far as DendroPy is concerned, lies with the
programmer. I recognize that for more end-user oriented tools like
others being discussed here, the tool itself may take up the responsibility.

-- jeet

On 5/17/16 1:05 PM, Lucas Czech wrote:

Hm, I don't fully understand what you want to say.

You state

Assuming one of these as a default is going to Astonish a whole lot
of people.

which is exactly my point: don't assume - ask the user!

Further, you say

So, we default to treating the label as just a label, and provide
the infrastructure to interpret these as needed.

This is IMO the best way to handle this. But, if I understood correctly,
you still assume them to be node labels, right? That's the crux. I
think, this also should be up to the user to decide - particularly
because they often do not read the manual!

That means, when reading data in Newick, there should be a way to
distinguish between the interpretations. AFAIK, ETE does this by
offering a flag paramter in the reading function, right @jhcepas
https://github.com/jhcepas?! Another way would be to offer multiple
reading functions, or (in languages like C++, where you have strong
types), read into a data structure that has the desired kind of member
variables.

This way, the user will automatically stumble upon the issue, because
function parameters etc are parts of the manual that one needs to read
in order to use the function.

So, in summary, I think we are on the same page. Our views just seem to
differ on how to incorporate the "user choice" into software. What do
you think?

Also, as a side comment, this topic would be more appropriately
discussed in some bioinformatics forum, because it affects more then
just DendroPy. This is not meant to be critique on your software, but a
more general discussion. Hope you don't mind discussing this here.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#53 (comment)


Jeet Sukumaran

[email protected]

Blog/Personal Pages:
http://jeetworks.org/
GitHub Repositories:
http://github.com/jeetsukumaran
Photographs (as stream):
http://www.flickr.com/photos/jeetsukumaran/
Photographs (by galleries):

http://www.flickr.com/photos/jeetsukumaran/sets/

@jeetsukumaran
Copy link
Owner

jeetsukumaran commented May 17, 2016

there is no such thing as a "node
label" in Newick,

Ah yes, but there IS in NEXUS:

http://sysbio.oxfordjournals.org/content/46/4/590.full.pdf+html

Here, note how the label is explicitly used to label the NODE, not the
edge.

THIS is the basis of DendroPy's syntactic (not semantic) assumption
that the label is associated with the node rather than the edge.

You are right that it would be convenient to have an option when parsing
NEWICK to allow for the label association to be selected by the user.
However, given the above, I think it is a sensible default to have the
label be associated with the node, whether or not this astonishes the user.

-- jeet

On 5/17/16 12:27 PM, Lucas Czech wrote:

According to Phylip
http://evolution.genetics.washington.edu/phylip/newicktree.html (which
I consider the "standard" definition), so all of this is a convention anyway.


Jeet Sukumaran

[email protected]

Blog/Personal Pages:
http://jeetworks.org/
GitHub Repositories:
http://github.com/jeetsukumaran
Photographs (as stream):
http://www.flickr.com/photos/jeetsukumaran/
Photographs (by galleries):

http://www.flickr.com/photos/jeetsukumaran/sets/

@jeetsukumaran
Copy link
Owner

@jhcepas I see the test data. Perhaps it would be useful to have an "expected output" tree string as well?

I am also curious as to what sort of value you reasonably expect to see in the old (original) root.

@jeetsukumaran
Copy link
Owner

@jhcepas

As the following code should make clear, IMHO, relying on node labels to store support values is a fragile, dangerous, and, in the long run, unsustainable hope. Once the tree gets re-rooted (i.e., you change the model of the tree from the model that was originally scored and to which, strictly speaking, is the only model to which the support values can be applied without additional data), you get "new" internal nodes for which no information is available.

Am I missing anything?

#! /usr/bin/env python

import dendropy

def get_tree():
    tree_str = "((C:1.0,D:1.0)10:0.001,(A:1.0, (B:1.0,X:1.0)30:0.01)20:0.1,E:1.0)0:0.0;"
    tree = dendropy.Tree.get(
            data=tree_str,
            schema="newick",
            rooting="force-rooted",
            )
    return tree

def naive_method():
    tree = get_tree()
    outgroup_node = tree.find_node_with_taxon_label("X")
    new_root = outgroup_node.parent_node
    tree.reseed_at(new_root)
    # print(tree.as_string("newick").strip())
    return tree

def support_aware_method():
    tree = get_tree()
    benc = tree.encode_bipartitions()
    support_values = {}
    for nd in tree:
        support_values[nd.bipartition] = float(nd.label) if nd.label is not None else 1.0

    outgroup_node = tree.find_node_with_taxon_label("X")
    new_root = outgroup_node.parent_node
    tree.reseed_at(new_root)

    tree.encode_bipartitions()
    for nd in tree:
        nd.label = support_values.get(nd.bipartition, "Not_specified_in_original_tree")
    return tree

print("\n[Original Tree:]\n    ")
print(get_tree().as_string("newick").strip())
print("\n[Naive Tree:]\n    ")
print(naive_method().as_string("newick").strip())
print("\n[Support-Aware Tree:]\n")
print(support_aware_method().as_string("newick").strip())

Results in:

[Original Tree:]

[&R] ((C:1.0,D:1.0)10:0.001,(A:1.0,(B:1.0,X:1.0)30:0.01)20:0.1,E:1.0)0:0.0;

[Naive Tree:]

[&R] (B:1.0,X:1.0,(A:1.0,((C:1.0,D:1.0)10:0.001,E:1.0)0:0.1)20:0.01)30:0.0;

[Support-Aware Tree:]

[&R] (B:1.0,X:1.0,(A:1.0,((C:1.0,D:1.0)10.0:0.001,E:1.0)'Not_specified_in_original_tree':0.1)'Not_specified_in_original_tree':0.01):0.0;

@jhcepas
Copy link
Author

jhcepas commented May 17, 2016

for the ETE API (also not a tool), users can use a flag when loading node labels (the text string close to the perenthesis) as 1) node.support or 2) node.name.

After loading, node.support is always treated as the statistical branch support, so it is correctly re-mapped when rooting. For the "new" node without support value created after rooting, ETE uses the default support value, which is, 1. That's also not good and I should find a way to allow for missing support values.

The rest of node attributes (node.name and anything else loaded using the extended newick format) are used as in dendropy, simply as node labels. IMO, also not totally good, as many labels are actually branch-labels that should be cleared or moved with branches when the tree is rearranged. Or at least warn the user to avoid misinterpretations. For instance, internal taxids, population sizes, etc should be probably cleared, as they might loose their meaning after re-rooting. But that would require a proper newick standard, or using NExML or similar.

@jhcepas
Copy link
Author

jhcepas commented May 17, 2016

at this point I am even wondering if the distinction between nodes and branches makes any sense :)

@jeetsukumaran
Copy link
Owner

at this point I am even wondering if the distinction between nodes and branches makes any sense :)

In the sense that each node has unique edge subtending it and only it ... no.

But in the sense that each edge ALSO represents a split/bipartition, the value of which may change during tree restructuring even if the particular node that it subtends does not ... YES!

@mtholder
Copy link
Collaborator

A gist: https://gist.github.com/mtholder/8c5a4079d04a0129c652e431575add01
tweaks @jeetsukumaran's example. The code change is to allow the label to be treated as a support statement for the unrooted bipartition.

I also changed the branch lengths and support values to make them easier to trace on the tree (not just 0s and 1s). I get:

[Original Tree:]

[&R] ((C:1.3,D:4.0)34:0.034,(A:1.1,(B:1.2,X:1.6)26:0.026)12:0.0126,E:1.5);

[Labels-attached-to-nodes Tree:]

[&R] (B:1.2,X:1.6,(A:1.1,((C:1.3,D:4.0)34:0.034,E:1.5):0.0126)12:0.026)26;

[Labels-attached-to-unrooted-bipartition-method Tree:]

[&U] (B:1.2,X:1.6,(A:1.1,((C:1.3,D:4.0)34.0:0.034,E:1.5)12.0:0.0126)26.0:0.026)1.0;

[labels-attached-to-rooted-bipartition-method Tree:]

[&R] (B:1.2,X:1.6,(A:1.1,((C:1.3,D:4.0)34.0:0.034,E:1.5)'Not_specified_in_original_tree':0.0126)'Not_specified_in_original_tree':0.026)1.0;

where Labels-attached-to-unrooted-bipartition-method is how users should move labels around if they represent split support for an unrooted tree.

@lczech
Copy link

lczech commented May 18, 2016

Thanks for the clarifications!

The distinction between syntactic and semantic interpretation is indeed important. So far, I have (carelessly) used the latter, because it implies the former. But you are right that this should be clearly separated. Sorry for the confusion.

Also, defaulting to the syntactic interpretation that the data belongs to the nodes seems reasonable. Remark: by "asking the user" in case of a library, I meant something like offer a flag or the like.

As for the question whether the distinction between nodes and edges makes sense (i.e., is important at all), I'd say yes for the same reason as @jeetsukumaran: bipartitions - but also for edge-related data. See the referenced issue #54:

But really it is a waste of an edge object, because all the information associated with the "edge" can just be dumped onto the node, as there is a strict one-to-one correspondence between edges and the subtended nodes

That very much depends on your use case and your internal tree structure. I don't know how DendroPy stores trees, but I guess they are always rooted. Then, there is a one-to-one correspondence. However, and this was one of the points of our preprint, this is not the case for unrooted trees. For those, it is more natural to have actual edge data.

Also, speaking from a software modelling point of view, I always found it simpler to separate node and edge data. It makes usage clearer. (Maybe this is biased because the internal structure in my library http://genesis-lib.org/ is always unrooted, so that I need that distinction anyway.)

So, to finally come to the (current) issue title: "Can node labels be parsed as support values for tree rooting purposes?". You probably don't want to rewrite all the node/edge data handling in DendroPy (and shouldn't). But what do you think could be done to avoid problems related to the interpretation of those data?

@jeetsukumaran
Copy link
Owner

jeetsukumaran commented May 18, 2016

That very much depends on your use case and your internal tree
structure. I don't know how DendroPy stores trees, but I guess they are
always rooted. Then, there is a one-to-one correspondence. However, and
this was one of the points of our preprint, this is not the case for
unrooted trees. For those, it is more natural to have actual edge data.

(1) DendroPy handles both rooted and unrooted trees. When dealing with
formats such as NEWICK and NEXUS, unless it is is explicitly told that
the tree is rooted, it reads it as unrooted by default. This causes a
lot of confusion with sloppy bioinformaticians who drop the "[&R]", do
not read the documentation to use the "rooting='force-rooted'", and
assume/need rooted trees. Unfortunately, not only has this behavior been
the default since 1.0 and would be too late to change for legacy
reasons, but it is sooooooo clearly documented in multiple places
including the tutorial:

http://dendropy.org/primer/treemanips.html#rooting-derooting-and-rerooting

that the astonishment is, the sole responsibility of the astonished.

(2) You are absolutely right for unrooted trees, edges are important.
But here you do NOT have the case where there is a clear, unambiguous
one-to-one correspondence between edges and nodes. Here edges represent
connections between bipartitions. This situation, then, is the second of
the two perspectives that I describe, and, as I note, edges are
absolutely meaningful.

Also, speaking from a software modelling point of view, I always found
it simpler to separate node and edge data. It makes usage clearer.
(Maybe this is biased because the internal structure in my library
http://genesis-lib.org/ is always unrooted, so that I need that
distinction anyway.)

Yes, but my point is in the case of one-to-one correspondence between
edges and nodes, the distinction between edges and nodes can be
abstracted away in the implementation with no detrimental effect or loss of information in the domain.

So, to finally come to the (current) issue title: "Can node labels be
parsed as support values for tree rooting purposes?". You probably don't
want to rewrite all the node/edge data handling in DendroPy (and
shouldn't). But what do you think could be done to avoid problems
related to the interpretation of those data?

The answer is "yes", in a number of different ways, though not in the
one specific way you suggest above. And there is no need to rewrite all
the node/edge data handling to do so. DendroPy already does natively.

(1) The ideal solution, but alas, very unrealistic one, is simply to
move away from NEWICK. To DendroPy NEWICK is not the end-all-and-be-all
of tree formats. It is, in fact, not only Just Another Tree Format, but
a particularly sloppy and troublesome one. Unfortunately, it, or its
superset NEXUS, is so widespread in phylogenetics to the point of being
practically the only tree format in this domain despite more compelling
alternatives.

There better formats out there, but the push to use this is not going to
come from users. It is going to have to come from software developers
who recognize the shortcomings and care enough to demand something
better. Tools and software should continue to support NEWICK/NEXUS, of
course, but as a secondary "import-only" format, with the primary format
being something else. BEAST does this with their data.

Of course, if the choice of primary format is fragmented, then this is
doomed because NEWICK/NEXUS will end up being the lingua franca format
for various different programs that all have a different "better" format
of their choice. And this, I think, is almost certainly going to be the
case. So, this solution has very depressing non-fate in the near future
unless things change radically :( . (And, this change, again, HAS to
come from software developers by common agreement amongst themselves;
users really don't care and will user whatever works and is most
convenient for them).

(2) DendroPy also supports a very rich system of metadata annotations.
This, too, can be used to communicate the information association and
even interpretation. The limitations here are that with NEWICK/NEXML,
there is no real standard in how the metadata should be expressed: there
are numerous conventions, some more popular than others, and new ones
coming up all the time. DendroPy currently supports comment-based
metadata [support=1.0,color=red,population_size=301] in various
flavors (e.g., NHX, BEAST) as well as a few others. The problem here is
the expression of information is limited, and there is no standard set
of fields. DendroPy uses "support" in its SumTrees, but that is a
personal choice.

(3) Of course, as I note previously, it is incredibly trivial for a user
to reassign node labels to edges:

     for nd in t: nd.edge.label = nd.label

Assuming a competent programmer, the above is no real hardship, but it
is inefficient and irritating. So I think I agree that a simple flag at
parse time, e.g. assign_internal_labels_to_edges=True will help a
lot. I will implement this today itself.

(4) Providing built-in support for reading the node labels as SUPPORT
values specifically is easily done. But I think this is a very narrow
viewpoint. Those labels can be anything, of which support values are
just ONE interpretation. Internal node taxon labels are also a very
common use case. As are trait states. I myself frequently use them to
express geographical distribution. The list goes on and on and on. I see
no reason to elevate "support" to a special status requiring special
code to handle and not any of the others. I am not sure everyone will
agree. In fact, I know they don't. I get a lot of requests from users
whose particular use case and needs so dominate their perspective, that
they do not understand why DendroPy's defaults (such as unrooted trees,
or not automatically reading internal labels as taxa) are what they are
and want them changed. Given the wide variety of semantic
interpretation of those labels, and given how easy DendroPy makes it for
users to manipulate and operate on those labels once they have been
parsed, I think that special-casing treatment of support is not needed,
especially given the following ...

(5) DendroPy currently, lives much more upstream in a typical
bioinformatics/evoinformatics pipeline than the other tools discussed
here. So while it easily can, as noted above, CONSUME trees with support
values, it is more often used to PRODUCE them. In fact, one might argue
that this latter is by far the most popular usage of DendroPy, through
its bundled application/tool SumTrees. DendroPy has a very, very, very,
very extensive and rich infrastructure for calculating, summarizing,
manipulating, and/or querying bipartitions and support of bipartitions
on trees. And it gets these values from the raw data, not any labels.
From this perspective, the "labels" on nodes are just a clumsy way of
expressing the final result. And while DendroPy supports different ways
of re-reading these results correctly (again, as shown above), this is
not a common use case simply because most people at this point (i.e.,
after the support values have been calculated and expressed as node/edge
labels) want to visualize trees. It is not as common for someone to want
to carry out calculations on the processed support values expressed as
labels (as opposed to calculating the support values from the raw data
and then carrying out further calculations on them), and, in those rare
cases, the few extra lines of code to correctly interpret/use those
labels as bipartition support values, as shown above, is required on the
part of the programmer.

-- jeet

On 5/18/16 8:38 AM, Lucas Czech wrote:

Thanks for the clarifications!

The distinction between syntactic and semantic interpretation is indeed
important. So far, I have (carelessly) used the latter, because it
implies the former. But you are right that this should be clearly
separated. Sorry for the confusion.

Also, defaulting to the syntactic interpretation that the data belongs
to the nodes seems reasonable. Remark: by "asking the user" in case of a
library, I meant something like offer a flag or the like.

As for the question whether the distinction between nodes and edges
makes sense (i.e., is important at all), I'd say yes for the same reason
as @jeetsukumaran https://github.com/jeetsukumaran: bipartitions - but
also for edge-related data. See the referenced issue #54
#54:

But really it is a waste of an edge object, because all the
information associated with the "edge" can just be dumped onto the
node, as there is a strict one-to-one correspondence between edges
and the subtended nodes

That very much depends on your use case and your internal tree
structure. I don't know how DendroPy stores trees, but I guess they are
always rooted. Then, there is a one-to-one correspondence. However, and
this was one of the points of our preprint, this is not the case for
unrooted trees. For those, it is more natural to have actual edge data.

Also, speaking from a software modelling point of view, I always found
it simpler to separate node and edge data. It makes usage clearer.
(Maybe this is biased because the internal structure in my library
http://genesis-lib.org/ is always unrooted, so that I need that
distinction anyway.)

So, to finally come to the (current) issue title: "Can node labels be
parsed as support values for tree rooting purposes?". You probably don't
want to rewrite all the node/edge data handling in DendroPy (and
shouldn't). But what do you think could be done to avoid problems
related to the interpretation of those data?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#53 (comment)


Jeet Sukumaran

[email protected]

Blog/Personal Pages:
http://jeetworks.org/
GitHub Repositories:
http://github.com/jeetsukumaran
Photographs (as stream):
http://www.flickr.com/photos/jeetsukumaran/
Photographs (by galleries):

http://www.flickr.com/photos/jeetsukumaran/sets/

@jhcepas
Copy link
Author

jhcepas commented May 18, 2016

Assuming a competent programmer, the above is no real hardship, but it
is inefficient and irritating. So I think I agree that a simple flag at
parse time, e.g. assign_internal_labels_to_edges=True will help a
lot. I will implement this today itself.

Thanks @jeetsukumaran - IMHO that would be very useful for people analyzing trees returned by Phyml, Fasttree, Raxml, IQTree and many other, which tend to produce bootstrap/support values as node labels (or plain comments) by default.

@lczech
Copy link

lczech commented May 19, 2016

Thanks @jeetsukumaran for the detailed answer - I think we are getting to a solution here.

To (1): Unfortunately, no one reads manuals... So we can only hope to solve this by explicit APIs.
To (1, the second one): Getting rid of those old, incomplete formats like Newick would be nice, I fully agree! But that is not the scope of our current project.
To (2-4): Of course, support values are only an example of values that belong the branches. The point that @jhcepas and I were aiming at was actually about the distinction between node and edge data, not between possible usage of those. Sorry if that was not clear enough so far.

So, here is my summary of this issue, with a possible solution:

When working with phylogenetic trees, users often want to annotate metadata. Often, this data belongs to either the nodes or the branches of the tree. Independently of the interpretation of this data (which is up to the programmer), there is a problem with many file formats (particularly Newick) when storing such data. As we (unfortunately) cannot get rid of those formats easily, software still needs to support them - even if it is just for import. And particularly this import is a problem, because of the missing syntactic information whether certain metadata belongs to the nodes or edges of the tree. Thus, in order to aid the user, software should offer a way to distinguish between those two cases (again, independently of how the data is used in the end). For stand-alone applications, a user dialogue when reading such a file is a good way to do this. For libraries, the API should make it very clear where the metadata is going to end up (because no one reads the manual except for looking at function signatures). For example, by offering a flag in the reading function, or by offering different versions of the reading function.

My suggestion for DendroPy is thus, to incorporate such a flag (with default to store data in nodes, so that old code doesn't break).

Do you agree?

@jeetsukumaran
Copy link
Owner

jeetsukumaran commented May 20, 2016

To (1): Unfortunately, no one reads manuals... So we can only hope to solve this by explicit APIs.

Imagine this was a molecular sequencing protocol. And a biologist running it complained that they got no DNA because they were using it without reading the manual. Would this be acceptable? I do not see why programmers, biologist or otherwise, are absolved of the responsibility of not reading the manual.

The API is explicit.

It is explicitly described in the manual.

To (1, the second one): Getting rid of those old, incomplete formats like Newick would be nice, I fully agree! But that is not the scope of our current project.
To (2-4): Of course, support values are only an example of values that belong the branches. The point that @jhcepas and I were aiming at was actually about the distinction between node and edge data, not between possible usage of those. Sorry if that was not clear enough so far.
When working with phylogenetic trees, users often want to annotate metadata. Often, this data belongs to either the nodes or the branches of the tree. Independently of the interpretation of this data (which is up to the programmer), there is a problem with many file formats (particularly Newick) when storing such data. As we (unfortunately) cannot get rid of those formats easily, software still needs to support them - even if it is just for import. And particularly this import is a problem, because of the missing syntactic information whether certain metadata belongs to the nodes or edges of the tree. Thus, in order to aid the user, software should offer a way to distinguish between those two cases (again, independently of how the data is used in the end). For stand-alone applications, a user dialogue when reading such a file is a good way to do this. For libraries, the API should make it very clear where the metadata is going to end up (because no one reads the manual except for looking at function signatures). For example, by offering a flag in the reading function, or by offering different versions of the reading function.

(1) There is a problem only with the NEWICK (and, by extension, NEXUS) format.

(2) The real solution is to move away from the NEWICK format.

(3) You discount this solution, and instead decide to invest a lot of effort and worry in seeing how this format can be salvaged. Analogy: keeping financial accounts on wet tissue paper as a medium is fraught with dangers. A better solution is to use dry paper.

(4) The solution that you suggest is fragile and only applies in one restricted use case: i.e., where the metadata is ONLY applicable to the edges. What happens if there is a mix of metadata, e.g. some, like support values, applying to edges, but others, like trait states or internal taxon labels, applying to nodes? NEWCIK simply cannot handle that without a lot of contrivance.

(5) I also note that the problem you describe is a very small use case in the broad range of use cases that DendroPy deals with: i.e., rooting an unrooted tree with support expressed as internal labels. And it really is NOT a "problem" as such for DendroPy (in the sense that the correct results cannot be obtained). As I describe above, most people using DendroPy to calculate support simply (a) use SumTrees or directly using the underlying classes/methods/functions to calculate support from the raw data; or (b) are programmers who are aware of the limitations of the NEWICK format, and can easily switch the metadata association as shown above. Please see the code that I posted for switching the labels to branches or, even better, the code/results that @mtholder posted, which shows correct support values being recovered even when support values are read off node labels.

(6) That last point is worth repeating: DendroPy handles the use case of "rooting a tree with support expressed as labels" correctly, as long as the programer understands the limitations of the NEWICK format and how trees work. It is indeed good to raise awareness of the problem, but I do not agree that we should bend over backwards to change library code make up for programmers who do not due diligence in understanding the API, do not understand NEWICK, or do not understand how trees work.

(7) The second to last point, which I raised before but seems to have been missed, is also worth repeating: "rooting a tree with support expressed as labels" is not a primary use case from DendroPy's perspective. I can see how it is important or maybe even a dominant use case from a tree visualization tool perspective, or even a library that is tree visualization centric. But really, really, really: most people use DendroPy to generate trees for visualization tools to consume, and the support values are calculated/analyzed from the raw data.

My suggestion for DendroPy is thus, to incorporate such a flag (with default to store data in nodes, so that old code doesn't break).
Do you agree?

Yes, I do agree. Thank you for the suggestion. I will do so. I was planning to do so yesterday, but did not have the time. Hopefully I will do so today this weekend. I think it is a useful addition that will make things a bit more efficient, a bit more convenient and a bit less less irritating in some use cases (by avoiding the need for the programmer to reassign the values themselves).

@jeetsukumaran
Copy link
Owner

jeetsukumaran commented May 20, 2016

Also, @lczech and @jhcepas, could I suggest that you update the code for the DendroPy tests that you link to so as to use the code that @mtholder posted, which, does, indeed, give the correct results?

@lczech
Copy link

lczech commented May 24, 2016

The API is explicit.

Again, I am not criticizing DendroPy here. Just saying that in my opinion it is generally better for a library (any, not just DendroPy) to e.g. name functions and their parameters (flags) so that it is clear what they do. And of course, users are not absolved of the responsibility to read the manual. They just usually don't... Unfortunate as it is, we need to face this.

to (1-2): Agreed.

to (3): I discounted it because it is not easily done. That would need to come from the whole community, not just one library or one author. We could start some initiative in the future ;-)

to (4): There still seems to be confusion. If there is a mix of metadata, then of course, the programmer (user of the library) needs to do some more work. A simple flag ("write metadata to nodes or edges") is then no sufficient solution, but still more flexible then always storing it at nodes. However, if there is really such a case where some of the numbers in inner parts of the tree are intended for nodes and some others for edges, then there needs to be some other mechanism to distinguish that, thus Newick would be a really bad choice in the first place. So, users who would actually want to do that would have a really weird use case and would need to deal with this on their own anyway. In most cases however, the metadata is intended for either nodes or edges, so a flag should do.

to (5-7): Agreed, rerooting is a specific use case. However, we only used it as a practical example to show that the association of edges to nodes is fragile whenever the convention "edges are associated with the nodes at their distal end (the end away from the root)" is used. This convention is most prominent in Newick, but I have also seen it in actual tree data structures, so it might also be an internal invariant of programs and libraries. And as such, it might break in cases like rerooting, if not properperly treated (again, not criticizing DendroPy here - just a general caveat). Our paper was intended to raise awareness of this general issue. Maybe we should have stressed that more.

Yes, I do agree. Thank you for the suggestion. I will do so. I was planning to do so yesterday, but did not have the time. Hopefully I will do so today this weekend. I think it is a useful addition that will make things a bit more efficient, a bit more convenient and a bit less less irritating in some use cases (by avoiding the need for the programmer to reassign the values themselves).

Nice. That was our intention all along - and not to criticize DendroPy or its handling of tree metadata. Glad that we found consensus ;-)

@jhcepas
Copy link
Author

jhcepas commented Jun 9, 2016

I have updated the tests. They include now the rooted_bipartition_method.
https://github.com/jhcepas/test_branch_support_after_tree_rerooting#test_dendropypy
Let me know if this eventually becomes a flag in dendropy newick parser.
thanks!

@jeetsukumaran
Copy link
Owner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants