Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--func_annot #27

Open
nick-youngblut opened this issue Feb 11, 2021 · 6 comments
Open

--func_annot #27

nick-youngblut opened this issue Feb 11, 2021 · 6 comments

Comments

@nick-youngblut
Copy link

nick-youngblut commented Feb 11, 2021

The wiki page on profiling shows the output as:

        sample01 sample04 sample05 sample08
g00001      1       0        1        0
g00002      0       1        1        1
g00003      0       0        0        1
g00003      1       1        1        1

...but I get UniRef90 IDs for each pangenome instead of g[0-9]{5} (panphlan 3.1).

Which version of UniRef90 are the IDs from? I tried using map_eggnog_uniref90.txt.gz from the HUMAnN3 utility mapping file collection (UniRef 201901), and <5% of my panphlan output UniRef ID overlap with any IDs in the mapping file, suggesting that the panphlan UniRef IDs are from a different (older?) version of UniRef.

I didn't see anything in the wiki about which (biobakery) files are actually available to use with --func_annot. Can I use the HUMAnN3 utility mapping files?

@leonarDubois
Copy link
Member

Hello Nick !

The wiki page with g[0-9]{5} aims to give a example of what the output looks like. Sorry if it's confusing.

Both PanPhlAn 3 and HUMAnN 3 should use the same UniRef90 collection, but HUMAnN covers everything while PanPhlAn annotation files provided with the pangenome are species-specific and often contains uncharacterized (or poorly characterized) proteins. (details can be found in this preprint )

The --func_annot aim is simply to add some extra column in the output presence/absence matrix with some user provided mapping file: It can be the species annotation file provided with the downloaded pangenome or a user custom file.

Hope this could help you.
Btw I advise you to raise this kind of concerns on the bioBakery help forum

@nick-youngblut
Copy link
Author

The wiki page with g[0-9]{5} aims to give a example of what the output looks like. Sorry if it's confusing.

Thanks for the clarification.

Both PanPhlAn 3 and HUMAnN 3 should use the same UniRef90 collection

That is UniRef 2019-01, correct?

covers everything while PanPhlAn annotation files provided with the pangenome are species-specific and often contains uncharacterized (or poorly characterized) proteins

So then the UniRef IDs in the PanPhlAn output should be a subset of all UniRef IDs. This doesn't really explain the low % mapping of IDs to the Humann3 mapping files.

The --func_annot aim is simply to add some extra column in the output presence/absence matrix with some user provided mapping file: It can be the species annotation file provided with the downloaded pangenome or a user custom file.

That's good to know. What is the format?

Btw I advise you to raise this kind of concerns on the bioBakery help forum

You are right. This is more of a usage/docs question versus a bug/issue. Do you also want bugs reports on the bioBakery help forum?

@leonarDubois
Copy link
Member

Yes both used ChocoPhlAn (our internal pipeline) based on UniRef 2019-01

That is strange indeed that a low percentage of PanPhlAn UniRef90 maps, I'll check that whenever I find the time. I've you tried mapping UniRef50 instead ?

--func_annot should be the path to a tsv file mapping UniRef90 to whatever you want. If several columns are available, you can specify the one you want with the --field argument. Basically like the annotation file provided panphlan_[species_name]_annot.tsv

Yes, the best would be bug report/code related stuff on GitHub and usage/general discussions on the forum, there should be more people interacting and checking it. On top of that is will be more convenient when questions concern several software at the same time

@nick-youngblut
Copy link
Author

That is strange indeed that a low percentage of PanPhlAn UniRef90 maps, I'll check that whenever I find the time. I've you tried mapping UniRef50 instead ?

Any updates on this? Were you able to reproduce the low UniRef90 ID mapping rate?

@nick-youngblut
Copy link
Author

I've you tried mapping UniRef50 instead ?

Where is the docs on using UniRef50 instead of UniRef90? In the wiki, I only see info on using UniRef90.

@leonarDubois
Copy link
Member

Hello Nick,

sorry, I've been busy with other projects in the past month and I haven't check that yet.
I'll let you know when I'll have some news.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants