Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_taxonomy_from_gtdb-tk.py - help! #23

Open
aksha19n opened this issue May 16, 2024 · 6 comments
Open

add_taxonomy_from_gtdb-tk.py - help! #23

aksha19n opened this issue May 16, 2024 · 6 comments

Comments

@aksha19n
Copy link

I am trying to run this script but it keeps returning with this
"The genomes.instruction file has been updated with 0 genome(s) taxonomy indications, using '.fasta' extension"
Could you please tell me if there is anything that I can do to fix it ?

@Matteopaluh
Copy link
Owner

Hello!

To properly reply I'd need a little more informations, such as:

  • which specific command line you used?
  • what's the format of your input file(s)? i.e. did you use a compatible version of gtdb-tk and/or GTDB database?

Best,
Matteo

@aksha19n
Copy link
Author

Hi Matteo,

I installed KEMET on a UNIX system through conda and ran the script add_taxonomy_from_gtdb-tk.py
I ran my genomes through the classify microbes with GTDB-Tk-v2 3.2 workflow available on Kbase. The output files from that were used to run the gtdb to ncbi majority vote script which provided me with a .tsv file containing id no, GTDB classification and NCBI classification. I ensured that the sample/id names are same on the .tsv file and the genomes.instruction file prior to running the add taxonomy script.

Hope this helps

Thank you!

@Matteopaluh
Copy link
Owner

Matteopaluh commented May 16, 2024

Thanks for the extra details!

I've only tested the script from input obtained with gtdb-tk command line (so a difference could arise from that aspect).
Same goes for the gtdb-to-ncbi script, which depends on a specific version of the GTDB database.. Right now the add_taxonomy_from_gtdb-tk.py script used to work for the 2022 "GTDB R07-RS207" release, as well as 2022 NCBI taxonomy.

I'm not excluding that major changes in taxonomy could have actually happened (I remember some changes regarding Firmicutes to Bacillota maybe?). - This would require fixing the correspondance from NCBI to KEGG BRITE taxonomy.

Else my suspect would be regarding the file extensions of your genomes/MAGs files (whether it was .fasta, .fa, .fna, as it is required from the script in object and specified through the -f argument when running it.

Best regards,
Matteo

@aksha19n
Copy link
Author

Hi Matteo,

Thank you!
The file extensions and names match in the genomes.instruction file and the output file from GTDB. I downloaded the metadata files for r207 and ran the gtdb to ncbi script and used the output file from that to run the add_taxonomy and it worked. However, when i ran the kemet.py code i ran into an error
File "kemet.py", line 781, in taxonomy_filter
for line in v[i_start+1:]:
UnboundLocalError: local variable 'i_start' referenced before assignment

Could you kindly guide me with this error?

@Matteopaluh
Copy link
Owner

Hi Matteo,
Thank you! The file extensions and names match in the genomes.instruction file and the output file from GTDB. I downloaded the metadata files for r207 and ran the gtdb to ncbi script and used the output file from that to run the add_taxonomy and it worked.

Nice to know! Could you specify what you did precisely?
This could serve as a temporary fix until I modify a few things 🙃

Right now I've seen that KEGG BRITE was updated to reflect the changes in the NCBI taxonomy as expected, therfore it will take a couple checks to bring the add_taxonomy script up-to-date.

However, when i ran the kemet.py code i ran into an error File "kemet.py", line 781, in taxonomy_filter for line in v[i_start+1:]: UnboundLocalError: local variable 'i_start' referenced before assignment

Could you kindly guide me with this error?

Do you have the KEGG BRITE file br08601.keg in your working folder? This should be downloaded automatically when setting the working folder via the set_kemet_working-directory.py script.

If not, the file should be there.
Else, I'll need to check if that file is still formatted in the way it was in 2022.

Best regards,
Matteo

@aljazdzy
Copy link

aljazdzy commented Aug 7, 2024

I'm also having a very similar issue. For some reason it won't match the genome names in genomes.instruction to the gtdb to ncbi output despite them having identical names beside the file extensions (.fna)
I've spent awhile trying to debug the script but I can find no solution or any reason why there is a problem. It just runs, and if I didn't add some statement outputs for whether or not it matched I'd just see no further input to the genomes.instruction file. But it's clear that for some reason it isn't correctly matching the names despite the fact that the names are identical. I can't figure it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants