-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a method to use a high performance implementation of UniFrac #143
Conversation
This implementation is suggested for large datasets.
Thanks, @ElDeveloper! One additional point, we include support for Generalized UniFrac and Variance Adjusted UniFrac. We also support meta UniFrac but it assumes variadic arguments support (qiime2/qiime2#234); it may be nice to put in a near term work around though. |
Take a look at some of the other QIIME 2 repos for examples of adding test data files -- we have a custom
Citation support is currently limited to free text, so you can add whatever citations you want to the plugin using the |
Thanks for the guidance Jai, I'll put these in place.
…On (Sep-20-17|15:57), Jai Ram Rideout wrote:
> Tests, it would make more sense to add a few files to the repo, rather than having to create those on the fly, but I am not sure if there's a preferred way to do this.
Take a look at some of the other QIIME 2 repos for examples of adding test data files -- we have a custom `unittest.TestCase` subclass that provides a helper for locating test data. I think q2-types is a good example -- let us know if you run into issues. Please try to keep the files small.
> Citation information, it would be nice to include more information about the specific papers that should be cited for each metric. We previously had this when this was its own plugin, see here.
Citation support is currently limited to free text, so you can add whatever citations you want to the plugin using the `citation_text` field (I don't recommend putting citation info in the `description` field like you have in your linked example).
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#143 (comment)
|
@jairideout, I've added the citation text and tests. 👍 This should be contingent on @wasade publishing a conda package, but is otherwise good to review. |
@ElDeveloper, would it be possible to expose |
62a5602
to
e4aa0ef
Compare
0127155
to
5f06c76
Compare
Tests are passing 🎉 |
64dc728
to
80091a3
Compare
Looks like it worked! We've got new environment files to test with 🎉! |
That's excellent!!
On Sep 23, 2017 11:49, "Evan Bolyen" <[email protected]> wrote:
Looks like it worked
<qiime2/environment-files@2e173e3>!
We've got new environment files to test with 🎉!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#143 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAc8sg2QRWYO67iYFhbdcmJF_37LwJcYks5slVKIgaJpZM4PdZF2>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review, this is generally looking really good though! I'm going to pull this down and play with it a bit, but here's a few comments in the meanwhile.
q2_diversity/_beta/_method.py
Outdated
@@ -63,6 +69,27 @@ def beta_phylogenetic(table: biom.Table, phylogeny: skbio.TreeNode, | |||
return results | |||
|
|||
|
|||
def beta_phylogenetic_hpc(table: BIOMV210Format, phylogeny: NewickFormat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How you guys feel about something like: beta_phylogenetic_alt
, then it may not be too much of a surprise if/when this gets merged into beta_phylogenetic
.
We could also do something with unifrac
like unifrac_variants
or something, but given you are trying to roughly match the signature, it seems like an -alt
suffix might work pretty well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion, changed to alt!
q2_diversity/plugin_setup.py
Outdated
'https://doi.org/10.1186/1471-2105-12-118.' | ||
'Generalized UniFrac: ' | ||
'Chen et al. 2012 Bioinformatics; DOI: ' | ||
'10.1093/bioinformatics/bts342') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have a convention here, but some newline characters might work well to separate these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't see any changes with the new lines I added. ¯_(ツ)_/¯
q2_diversity/plugin_setup.py
Outdated
'weighted_unnormalized_unifrac', | ||
'weighted_normalized_unifrac', | ||
'generalized_unifrac']), | ||
'n_jobs': Int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Int % Range(1, None)
? What happens with negatives or zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Negative numbers are not supported, so this is great!
q2_diversity/plugin_setup.py
Outdated
'generalized_unifrac']), | ||
'n_jobs': Int, | ||
'variance_adjusted': Bool, | ||
'alpha': Float, 'bypass_tips': Bool}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a newline so we can find bypass_tips
in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this means, but done.
q2_diversity/plugin_setup.py
Outdated
'metric': 'The beta diversity metric to be computed.', | ||
'n_jobs': 'The number of workers to use.', | ||
'variance_adjusted': ('Perform variance adjustment based on Chang et ' | ||
'al. BMC Bioinformatics 2011'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to briefly describe what that means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added an explanation, let me know if it makes sense, also CCing @wasade to keep him in the loop.
q2_diversity/plugin_setup.py
Outdated
'correspond to approximately 50% of the vertices in ' | ||
'the tree, which is how the compute time reduction is' | ||
' achieved. The use of this parameter is analogous ' | ||
'in concept to switching from 99% OTUs to 97% OTUs.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just say:
In a bifurcating tree, the tips make up about 50% of the nodes in a tree. By ignoring them, specificity can be traded for reduced compute time.
Or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's much more concise, thanks.
q2_diversity/plugin_setup.py
Outdated
'generalized_unifrac']), | ||
'n_jobs': Int, | ||
'variance_adjusted': Bool, | ||
'alpha': Float, 'bypass_tips': Bool}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Float
-> Float % Range(0, 1, inclusive_end=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I had previously tried Float % Range(0, 1)
but that wasn't working, I'm glad this works.
q2_diversity/plugin_setup.py
Outdated
"2012), Variance Adjusted UniFrac (Chang et al. 2011), " | ||
"as well as Weighted normalized and unnormalized UniFrac " | ||
"(Lozupone et al. 2007), unweighted UniFrac " | ||
"(Lozupone et al. 2005), and metagenomic UniFrac " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think metagenomic UniFrac is exposed at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
q2_diversity/_beta/_method.py
Outdated
elif metric == 'weighted_normalized_unifrac': | ||
f = unifrac.weighted_normalized | ||
elif metric == 'generalized_unifrac': | ||
f = partial(unifrac.generalized, alpha=alpha) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but a dictionary lookup might be cleaner (that's usually what we do).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It made things cleaner, though we needed a bit of special casing though 👍
q2_diversity/_beta/_method.py
Outdated
def beta_phylogenetic_hpc(table: BIOMV210Format, phylogeny: NewickFormat, | ||
metric: str, n_jobs: int=1, | ||
variance_adjusted: bool=False, | ||
alpha=1.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong recommendation, but it might be nice to error if alpha
is passed when the metric isn't generalized UniFrac. That would require you set the default to something like None
though, which definitely obfuscates things a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed your suggestion here.
@@ -63,6 +76,31 @@ def beta_phylogenetic(table: biom.Table, phylogeny: skbio.TreeNode, | |||
return results | |||
|
|||
|
|||
def beta_phylogenetic_alt(table: BIOMV210Format, phylogeny: NewickFormat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to politely catch and err if the table
has < 4 samples? That's an edge case we haven't handled yet behind the scenes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that this is the reason why we've observed failures, for all the unit tests we compute distances on a 2x2 sample matrix, and those tests pass.
) | ||
|
||
plugin.methods.register_function( | ||
function=q2_diversity.beta_phylogenetic, | ||
inputs={'table': FeatureTable[Frequency], | ||
'phylogeny': Phylogeny[Rooted]}, | ||
parameters={'metric': Str % Choices(beta.phylogenetic_metrics()), | ||
'n_jobs': Int}, | ||
'n_jobs': Int % Range(1, None)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is minor, and ultimately whatever fits better into the surrounding terminology is preferred, but it is more correct to describe this as threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jobs is probably better for now, since it sounds like this is actually CPU threads, and not a process thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's process threads (i.e., pthread
) that we bind to cores to reduce thread hopping and to leverage NUMA's default local allocation policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote this, I knew it was misleading, but carried on since it matches beta_phylogenetic. Would probably be nice to have a standard designation for "workers" all throughout QIIME.
q2_diversity/plugin_setup.py
Outdated
'n_jobs': 'The number of workers to use.', | ||
'variance_adjusted': ('Perform variance adjustment based on Chang et ' | ||
'al. BMC Bioinformatics 2011. Weights distances ' | ||
'by the number of tips evaluated for each ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It weights based on the proportion of the relative abundance represented between the samples at a given node under evaluation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
q2_diversity/plugin_setup.py
Outdated
' proportions are dichotomized.'), | ||
'bypass_tips': ('In a bifurcating tree, the tips make up about 50% of ' | ||
'the nodes in a tree. By ignoring them, specificity ' | ||
'can be traded for reduced compute time.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add: "time. This has the effect of collapsing the phylogeny, and is analogous in concept to moving from 99% to 97% OTUs"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more comments
q2_diversity/_beta/_method.py
Outdated
|
||
if alpha is not None and metric != 'generalized_unifrac': | ||
raise ValueError('The alpha parameter is only allowed when the choice' | ||
' of metric is generalized_unifrac') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "opposite" of this causes unifrac
to raise an error, but the user is too far removed to be able to pinpoint the cause.
So we should add a branch to either set alpha=1
when alpha == None
or to explain that alpha
needs to be set. Either should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!!! 👏
f = metrics[metric] | ||
|
||
# unifrac processes tables and trees should be filenames | ||
return f(str(table), str(phylogeny), threads=n_jobs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when n_jobs
is greater than logical cores, we get a segfault (when we're lucky). We should just guard that value to be at or less than cores present on the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
psutil
to the rescure 💪
ci/recipe/meta.yaml
Outdated
@@ -32,6 +32,7 @@ requirements: | |||
- ipywidgets | |||
- scikit-bio | |||
- scikit-learn | |||
- psutilt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra t
Thanks for fixing @ebolyen! |
Woooo!!! Thank you guys!
On Sep 26, 2017 3:39 PM, "Evan Bolyen" <[email protected]> wrote:
Merged #143 <#143>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#143 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAc8shUpc72VISL-B-_gPmRj1HTELnY9ks5smX0JgaJpZM4PdZF2>
.
|
This PR adds the necessary code to use the implementation of UniFrac in https://github.com/biocore/unifrac . This method is recommended for large datasets.
As a comparison, for a table with ~120 samples and a tree with 20K tips running time is: 2 minutes and 27 seconds with the current implementation, and 2 seconds with the new implementation. @wasade has been benchmarking this more thoroughly but, thought this served as a nice and small comparison.
As discussed at the developer's meeting, this method does not replace beta_phylogenetic, as it is still fairly new, instead it is provided as an alternative (beta_phylogenetic_hpc) that users can test out.
Currently, there's a few things missing in this PR: