Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get gtdbtk to use dfast proteins as input. #142

Closed
LeeBergstrand opened this issue Feb 22, 2024 · 3 comments
Closed

Get gtdbtk to use dfast proteins as input. #142

LeeBergstrand opened this issue Feb 22, 2024 · 3 comments
Labels
enhancement New feature or request long_term

Comments

@LeeBergstrand
Copy link
Collaborator

LeeBergstrand commented Feb 22, 2024

gtdbtk has a --genes parameter that allows gtdbtk to use the output of gene prediction pipelines rather than prodigal as input.

This parameter causes gtdbtk to take proteins as input (Ecogenomics/GTDBTk#571).

I'm wondering if the speed-up of skipping prodigal inside the GTDBtk rule is worth it, as it skips the ANI and mash steps and starts putting the found markers in the GTDB trees using pplacer immediately. On my machine, I run out of memory using pplacer. With my dataset, most of the time, the pipeline skips the pplacer step. If the ANI screen finds a close match (you have an organism already in the tree), I think it skips the marker gene insert step and speeds up the pipeline.

@jmtsuji Would using --genes be useful for you as an optional approach?

@LeeBergstrand LeeBergstrand added enhancement New feature or request long_term labels Feb 22, 2024
@LeeBergstrand
Copy link
Collaborator Author

To do the conversion, I had to change the batch file to point at the dfast proteins file rather than the genome file and add the --genes flag.

I also got the following warning:

The final classification predicted may be less accurate due to the use of amino acid files instead of nucleotide files as input to the pipeline. Without nucleotides files, the ANI classification step of the workflow has been skipped and therefore no ANI matches with existing species in GTDB could be reported. 

@jmtsuji
Copy link
Collaborator

jmtsuji commented Feb 23, 2024

@LeeBergstrand Thanks for this idea and the extra context! Just to confirm, are the key reasons for exposing the --genes flag to provide an annotation speedup (by skipping Prodigal) and to force execution of pplacer (skipping the ANI search, if a user wants to skip this, e.g., for running tests)?

On my end, aside from those possible benefits, I can see the following possible disadvantages:

  • Because the GTDB-Tk was designed with Prodigal-based annotations in mind, I wonder if it might affect benchmarks a little bit by providing annotations from other tools like DFAST. (For example, it seems like the GTDB-Tk team did not want to jump on switching from Prodigal to Pyrodigal without some testing: Use Pyrodigal as Prodigal alternative Ecogenomics/GTDBTk#456 )
  • Am I correct that setting --genes skips the ANI step entirely? If so, I wonder if using --genes might ultimately cost more time (by forcing pplacer) compared to just re-annotating the genome with Prodigal, for average use cases.

Weighing these advantages and disadvantages, I wonder if most users would not need to use --genes. What do you think? Are there other advantages you foresee? Or do you think this setting might be useful for end-to-end tests and benchmarking of rotary? Thanks!

@LeeBergstrand
Copy link
Collaborator Author

On my end, aside from those possible benefits, I can see the following possible disadvantages:

I think, given the issues you brought up and the fact that Prodigal runs fast enough, most of the advantages of running --genes would be mitigated by the slowdown from not doing the ANI search. We can explore this later. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request long_term
Projects
None yet
Development

No branches or pull requests

2 participants