-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can node labels be parsed as support values for tree rooting purposes? #53
Comments
Jaime (and @mtholder, @lczech, @stamatak) (1) As documented, DendroPy assumes unrooted unless noted otherwise.
(2) Note that in example you give, in no way is DendroPy doing ANY (2) The labels-on-nodes <=> support for bipartitions is fragile (3) DendroPy has a very rich and rigorous infrastructure for dealing (4) To do this operation correctly in DendroPy, after the tree has been (6) This will still, however, result in null/None value at the original (7) The reason for the above is what I think is a pervasive fallacy (8) Of course, there is no reason that tree viewers need not up their -- jeet On 5/17/16 10:44 AM, Jaime Huerta-Cepas wrote:
Jeet Sukumaran[email protected]Blog/Personal Pages: http://www.flickr.com/photos/jeetsukumaran/sets/ |
Hi, I agree that, by default, text after a close parentheses should be treated as a text label of the node (as the newick standard says). But it would be nice to have an easy way to push a node property into an edge data struct in dendropy. (and then display the edge property on in the ascii tree). there may be an easy way, I'm not sure... |
@jeetsukumaran @mtholder, I agree on every point. |
@mtholder, WRT to node vs edge labels. One of the big bugbearboo-boos with Newick. There is no shortage of ways if you are willing to come up and use with YAINFH (Yet Another Idiosyncratic Newick Format Hack). Anything from a symbol, such as a hash mark to indicate the label should be associated with the subtending edge rather than node, or something using comment metadata. Or, serialize using a format that explicitly supports distinct expression of edge vs. node labels, such as NeXML, etc. |
Wow, lot of interesting discussion already going on here, nice. It seems all of us had their problems with the Newick format... Let's try to fix this! According to Phylip (which I consider the "standard" definition), there is no such thing as a "node label" in Newick, so all of this is a convention anyway. Problematically, every tool adheres to their own interpretation. Thus, I disagree that it is the users responsibility to keep track of the different conventions being used by all the tools they use. Instead, the tools should offer proper support for the data. That implies the following. If for certain data (e.g., labels in Newick) there is no way to unambiguously distinguish their meaning (i.e., node vs branch label), the tool should not simply assign a (preferred) interpretation. It should instead ask the user what semantics he or she actually expects. The newer versions of Dendroscope (since 3.5.0) already do this nicely. This is the Principle of Least Astonishment. Furthermore, if a particular tool internally only offers one of the interpretations (i.e., the internal data structures just use node labels, but do not have branch labels), the tool should still tell the user what is happening. This way, errors are prevented, while implicitly educating users about the issue. By the way, I like "YAINFH", @jeetsukumaran ;-) |
The luxury of a restricted set of use cases allows this approach: i.e. a default interpretation of those labels. However, DendroPy is a library rather than a tool. It has a very broad range of use cases. And those poor, bloated, used-and-abused, overdetermined Newick node-and-edge-simultaneously labels get used for a LOT of things apart from support values by lots of different people: internal node taxa, fossil calibration times, population sizes, trait states, etc. etc. Assuming one of these as a default is going to Astonish a whole lot of people. Even "support values" in the labels often are more complex than just numeric: e.g., many people have support values as established under multiple different approaches concatenated together in the labels. So, we default to treating the label as just a label, and provide the infrastructure to interpret these as needed. You are right that this needs to be clearly documented, though. And we do that! Typically in multiple places (e.g., the tutorials as well as library API documentation). The problem is, as I am sure you are aware, the documentation is rarely visited by users! Each of whom assumes that their use case is the only use case that makes sense and wants to library to work naturally (by default) based on their needs, perspectives, expectations, without taking into account the broader range of use cases that may impose very different natural defaults. |
Hm, I don't fully understand what you want to say. You state
which is exactly my point: don't assume - ask the user! Further, you say
This is IMO the best way to handle this. But, if I understood correctly, you still assume them to be node labels, right? That's the crux. I think, this also should be up to the user to decide - particularly because they often do not read the manual! That means, when reading data in Newick, there should be a way to distinguish between the interpretations. AFAIK, ETE does this by offering a flag paramter in the reading function, right @jhcepas?! Another way would be to offer multiple reading functions, or (in languages like C++, where you have strong types), read into a data structure that has the desired kind of member variables. This way, the user will automatically stumble upon the issue, because function parameters etc are parts of the manual that one needs to read in order to use the function. So, in summary, I think we are on the same page. Our views just seem to differ on how to incorporate the "user choice" into software. What do you think? Also, as a side comment, this topic would be more appropriately discussed in some bioinformatics forum, because it affects more then just DendroPy. This is not meant to be critique on your software, but a more general discussion. Hope you don't mind discussing this here. |
@lczech yeap, we use a flag when loading trees, but by no means has to be the default in other toolkits. I have changed the title of this issue, as I think we are discussing different things. The question here would be if/how dendropy can parse node labels as branch support values, as this is very common in some scenarios. I think I understood from @jeetsukumaran that there is, indeed, a way. (?) |
I was referencing ETE interpreting the values as support values as default. Also note that, as DendroPy is a programming library, not a tool,
In a sense, the user can "decide" this, by simply writing:
But you are saying that we should offer the option of the above Fair enough: it's easy to do. But it still needs a default, and Note that there is a difference between assuming a label is associated The first is a syntactic assumption, and, if properly documented or at -- jeet On 5/17/16 1:05 PM, Lucas Czech wrote:
Jeet Sukumaran[email protected]Blog/Personal Pages: http://www.flickr.com/photos/jeetsukumaran/sets/ |
Ah yes, but there IS in NEXUS:
Here, note how the label is explicitly used to label the NODE, not the THIS is the basis of DendroPy's syntactic (not semantic) assumption You are right that it would be convenient to have an option when parsing -- jeet On 5/17/16 12:27 PM, Lucas Czech wrote:
Jeet Sukumaran[email protected]Blog/Personal Pages: http://www.flickr.com/photos/jeetsukumaran/sets/ |
@jhcepas I see the test data. Perhaps it would be useful to have an "expected output" tree string as well? I am also curious as to what sort of value you reasonably expect to see in the old (original) root. |
As the following code should make clear, IMHO, relying on node labels to store support values is a fragile, dangerous, and, in the long run, unsustainable hope. Once the tree gets re-rooted (i.e., you change the model of the tree from the model that was originally scored and to which, strictly speaking, is the only model to which the support values can be applied without additional data), you get "new" internal nodes for which no information is available. Am I missing anything?
Results in:
|
for the ETE API (also not a tool), users can use a flag when loading node labels (the text string close to the perenthesis) as 1) node.support or 2) node.name. After loading, node.support is always treated as the statistical branch support, so it is correctly re-mapped when rooting. For the "new" node without support value created after rooting, ETE uses the default support value, which is, 1. That's also not good and I should find a way to allow for missing support values. The rest of node attributes (node.name and anything else loaded using the extended newick format) are used as in dendropy, simply as node labels. IMO, also not totally good, as many labels are actually branch-labels that should be cleared or moved with branches when the tree is rearranged. Or at least warn the user to avoid misinterpretations. For instance, internal taxids, population sizes, etc should be probably cleared, as they might loose their meaning after re-rooting. But that would require a proper newick standard, or using NExML or similar. |
at this point I am even wondering if the distinction between nodes and branches makes any sense :) |
In the sense that each node has unique edge subtending it and only it ... no. But in the sense that each edge ALSO represents a split/bipartition, the value of which may change during tree restructuring even if the particular node that it subtends does not ... YES! |
A gist: https://gist.github.com/mtholder/8c5a4079d04a0129c652e431575add01 I also changed the branch lengths and support values to make them easier to trace on the tree (not just 0s and 1s). I get:
where Labels-attached-to-unrooted-bipartition-method is how users should move labels around if they represent split support for an unrooted tree. |
Thanks for the clarifications! The distinction between syntactic and semantic interpretation is indeed important. So far, I have (carelessly) used the latter, because it implies the former. But you are right that this should be clearly separated. Sorry for the confusion. Also, defaulting to the syntactic interpretation that the data belongs to the nodes seems reasonable. Remark: by "asking the user" in case of a library, I meant something like offer a flag or the like. As for the question whether the distinction between nodes and edges makes sense (i.e., is important at all), I'd say yes for the same reason as @jeetsukumaran: bipartitions - but also for edge-related data. See the referenced issue #54:
That very much depends on your use case and your internal tree structure. I don't know how DendroPy stores trees, but I guess they are always rooted. Then, there is a one-to-one correspondence. However, and this was one of the points of our preprint, this is not the case for unrooted trees. For those, it is more natural to have actual edge data. Also, speaking from a software modelling point of view, I always found it simpler to separate node and edge data. It makes usage clearer. (Maybe this is biased because the internal structure in my library http://genesis-lib.org/ is always unrooted, so that I need that distinction anyway.) So, to finally come to the (current) issue title: "Can node labels be parsed as support values for tree rooting purposes?". You probably don't want to rewrite all the node/edge data handling in DendroPy (and shouldn't). But what do you think could be done to avoid problems related to the interpretation of those data? |
(1) DendroPy handles both rooted and unrooted trees. When dealing with http://dendropy.org/primer/treemanips.html#rooting-derooting-and-rerooting that the astonishment is, the sole responsibility of the astonished. (2) You are absolutely right for unrooted trees, edges are important.
Yes, but my point is in the case of one-to-one correspondence between
The answer is "yes", in a number of different ways, though not in the (1) The ideal solution, but alas, very unrealistic one, is simply to There better formats out there, but the push to use this is not going to Of course, if the choice of primary format is fragmented, then this is (2) DendroPy also supports a very rich system of metadata annotations. (3) Of course, as I note previously, it is incredibly trivial for a user
Assuming a competent programmer, the above is no real hardship, but it (4) Providing built-in support for reading the node labels as SUPPORT (5) DendroPy currently, lives much more upstream in a typical -- jeet On 5/18/16 8:38 AM, Lucas Czech wrote:
Jeet Sukumaran[email protected]Blog/Personal Pages: http://www.flickr.com/photos/jeetsukumaran/sets/ |
Thanks @jeetsukumaran - IMHO that would be very useful for people analyzing trees returned by Phyml, Fasttree, Raxml, IQTree and many other, which tend to produce bootstrap/support values as node labels (or plain comments) by default. |
Thanks @jeetsukumaran for the detailed answer - I think we are getting to a solution here. To (1): Unfortunately, no one reads manuals... So we can only hope to solve this by explicit APIs. So, here is my summary of this issue, with a possible solution: When working with phylogenetic trees, users often want to annotate metadata. Often, this data belongs to either the nodes or the branches of the tree. Independently of the interpretation of this data (which is up to the programmer), there is a problem with many file formats (particularly Newick) when storing such data. As we (unfortunately) cannot get rid of those formats easily, software still needs to support them - even if it is just for import. And particularly this import is a problem, because of the missing syntactic information whether certain metadata belongs to the nodes or edges of the tree. Thus, in order to aid the user, software should offer a way to distinguish between those two cases (again, independently of how the data is used in the end). For stand-alone applications, a user dialogue when reading such a file is a good way to do this. For libraries, the API should make it very clear where the metadata is going to end up (because no one reads the manual except for looking at function signatures). For example, by offering a flag in the reading function, or by offering different versions of the reading function. My suggestion for DendroPy is thus, to incorporate such a flag (with default to store data in nodes, so that old code doesn't break). Do you agree? |
Imagine this was a molecular sequencing protocol. And a biologist running it complained that they got no DNA because they were using it without reading the manual. Would this be acceptable? I do not see why programmers, biologist or otherwise, are absolved of the responsibility of not reading the manual. The API is explicit. It is explicitly described in the manual.
(1) There is a problem only with the NEWICK (and, by extension, NEXUS) format. (2) The real solution is to move away from the NEWICK format. (3) You discount this solution, and instead decide to invest a lot of effort and worry in seeing how this format can be salvaged. Analogy: keeping financial accounts on wet tissue paper as a medium is fraught with dangers. A better solution is to use dry paper. (4) The solution that you suggest is fragile and only applies in one restricted use case: i.e., where the metadata is ONLY applicable to the edges. What happens if there is a mix of metadata, e.g. some, like support values, applying to edges, but others, like trait states or internal taxon labels, applying to nodes? NEWCIK simply cannot handle that without a lot of contrivance. (5) I also note that the problem you describe is a very small use case in the broad range of use cases that DendroPy deals with: i.e., rooting an unrooted tree with support expressed as internal labels. And it really is NOT a "problem" as such for DendroPy (in the sense that the correct results cannot be obtained). As I describe above, most people using DendroPy to calculate support simply (a) use SumTrees or directly using the underlying classes/methods/functions to calculate support from the raw data; or (b) are programmers who are aware of the limitations of the NEWICK format, and can easily switch the metadata association as shown above. Please see the code that I posted for switching the labels to branches or, even better, the code/results that @mtholder posted, which shows correct support values being recovered even when support values are read off node labels. (6) That last point is worth repeating: DendroPy handles the use case of "rooting a tree with support expressed as labels" correctly, as long as the programer understands the limitations of the NEWICK format and how trees work. It is indeed good to raise awareness of the problem, but I do not agree that we should bend over backwards to change library code make up for programmers who do not due diligence in understanding the API, do not understand NEWICK, or do not understand how trees work. (7) The second to last point, which I raised before but seems to have been missed, is also worth repeating: "rooting a tree with support expressed as labels" is not a primary use case from DendroPy's perspective. I can see how it is important or maybe even a dominant use case from a tree visualization tool perspective, or even a library that is tree visualization centric. But really, really, really: most people use DendroPy to generate trees for visualization tools to consume, and the support values are calculated/analyzed from the raw data.
Yes, I do agree. Thank you for the suggestion. I will do so. I was planning to do so yesterday, but did not have the time. Hopefully I will do so |
Again, I am not criticizing DendroPy here. Just saying that in my opinion it is generally better for a library (any, not just DendroPy) to e.g. name functions and their parameters (flags) so that it is clear what they do. And of course, users are not absolved of the responsibility to read the manual. They just usually don't... Unfortunate as it is, we need to face this. to (1-2): Agreed. to (3): I discounted it because it is not easily done. That would need to come from the whole community, not just one library or one author. We could start some initiative in the future ;-) to (4): There still seems to be confusion. If there is a mix of metadata, then of course, the programmer (user of the library) needs to do some more work. A simple flag ("write metadata to nodes or edges") is then no sufficient solution, but still more flexible then always storing it at nodes. However, if there is really such a case where some of the numbers in inner parts of the tree are intended for nodes and some others for edges, then there needs to be some other mechanism to distinguish that, thus Newick would be a really bad choice in the first place. So, users who would actually want to do that would have a really weird use case and would need to deal with this on their own anyway. In most cases however, the metadata is intended for either nodes or edges, so a flag should do. to (5-7): Agreed, rerooting is a specific use case. However, we only used it as a practical example to show that the association of edges to nodes is fragile whenever the convention "edges are associated with the nodes at their distal end (the end away from the root)" is used. This convention is most prominent in Newick, but I have also seen it in actual tree data structures, so it might also be an internal invariant of programs and libraries. And as such, it might break in cases like rerooting, if not properperly treated (again, not criticizing DendroPy here - just a general caveat). Our paper was intended to raise awareness of this general issue. Maybe we should have stressed that more.
Nice. That was our intention all along - and not to criticize DendroPy or its handling of tree metadata. Glad that we found consensus ;-) |
I have updated the tests. They include now the |
Implemented in |
Hi @jeetsukumaran and @mtholder!
I have recently expanded the analysis by Czech and Stamatakis regarding the incorrect mapping of support values after tree rooting in tree viewers, and found that multiple toolkits (perhaps also dendropy) might be affected.
@lczech, @stamatak and I are now working on reporting/confirming the bug in more frameworks and tree viewers.
Current tests are here:
https://github.com/jhcepas/test_branch_support_after_tree_rerooting
regarding dendropy, I don't know why the rooted topology returned looks as unrooted in the image. Is that normal? In any case, it seems that the support values are not mapped correctly (i.e. the support value of 20 does not correspond to the same partition as in the original tree).
Could you take a look and confirm that this is not a problem in our tests?
tx!
The text was updated successfully, but these errors were encountered: