-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for GTDB taxonomy? #36
Comments
MICOM doesn't really set any requirements for the taxonomy but you are right that you usually need the taxonomy of your data to match the taxonomy of the model database. I also thought about providing the model databases with different taxonomies but haven't found a good way to map NCBI taxon IDs to GTDB ones. If you know of a way to do so that would be great. Otherwise, we would have to get all the original genomes from the database and classify them but that would be pretty involved because it is not straightforward to get the genomes for the AGORA models for instance. |
You could use or build on a simple script that I wrote to map the NCBI taxonomy to the GTDB taxonomy: ncbi-gtdb_map.py. It simply uses the metadata provided by the GTDB, which includes NCBI and GTDB taxonomies for each genome. If you need to map at the taxid level, some of the other scripts in that repo might be useful. |
Oh cool, will try with that one. |
It's a bit embarrassing it took so long because I lumped this in with the general revamp of DB construction. But you can now find GTDB databases at https://zenodo.org/record/7739096 . For now I removed taxa where a single species maps to several species/genera in GTDB but I'm open for better suggestions. |
Hi @cdiener, just to confirm, the agora201_gtdb207_genus_1.qza file is a genus level aggregation of the agora2 (7000+ strain) model database using GTDB nomenclature? |
Yes that is correct. With the caveat mentioned above that I had to remove taxa that did not cleanly map to GTDB. The release page has links to the manifests of all included genera. |
Hi @cdiener, I downloaded the raw sequence (WMGS) data from the micom paper GitHub and ran classification using MetaPhlAn4. I then considered two separate specific cases:
using the
This seems to indicate that the caveat you mentioned is quite strong because not many bacterial models are passing the filter into their GTDB names. What do you think the best way to proceed will be? |
Hi @PathogeNish, hmm there could be a bunch of things going on. Can you share the metaphlan output table? Also did you filter unclassified genera before you calculated the coverage? uSGBs can probably not be matched well I would suspect. Another possiblity is a GTDB version mismatch. Some phyla got renamed recently so if you match in strict more that could be an issue. |
Checklist
Is your feature related to a problem? Please describe it.
The Genome Taxonomy Database (GTDB) is comprehensive (especially the new v202 release) and more robust than the NCBI microbial taxonomy, especially given that the GTDB taxonomy is completely based off of genome phylogenic relatedness.
Although the MICOM docs are vague about the taxonomy that one must use, it appears that the NCBI taxonomy is required.
Describe the solution you would like.
Provide direct support for the GTDB taxonomy.
The text was updated successfully, but these errors were encountered: