Get gtdbtk to use dfast proteins as input. #142

LeeBergstrand · 2024-02-22T22:06:08Z

gtdbtk has a --genes parameter that allows gtdbtk to use the output of gene prediction pipelines rather than prodigal as input.

This parameter causes gtdbtk to take proteins as input (Ecogenomics/GTDBTk#571).

I'm wondering if the speed-up of skipping prodigal inside the GTDBtk rule is worth it, as it skips the ANI and mash steps and starts putting the found markers in the GTDB trees using pplacer immediately. On my machine, I run out of memory using pplacer. With my dataset, most of the time, the pipeline skips the pplacer step. If the ANI screen finds a close match (you have an organism already in the tree), I think it skips the marker gene insert step and speeds up the pipeline.

@jmtsuji Would using --genes be useful for you as an optional approach?

The text was updated successfully, but these errors were encountered:

LeeBergstrand · 2024-02-22T22:13:29Z

To do the conversion, I had to change the batch file to point at the dfast proteins file rather than the genome file and add the --genes flag.

I also got the following warning:

The final classification predicted may be less accurate due to the use of amino acid files instead of nucleotide files as input to the pipeline. Without nucleotides files, the ANI classification step of the workflow has been skipped and therefore no ANI matches with existing species in GTDB could be reported.

jmtsuji · 2024-02-23T14:16:59Z

@LeeBergstrand Thanks for this idea and the extra context! Just to confirm, are the key reasons for exposing the --genes flag to provide an annotation speedup (by skipping Prodigal) and to force execution of pplacer (skipping the ANI search, if a user wants to skip this, e.g., for running tests)?

On my end, aside from those possible benefits, I can see the following possible disadvantages:

Because the GTDB-Tk was designed with Prodigal-based annotations in mind, I wonder if it might affect benchmarks a little bit by providing annotations from other tools like DFAST. (For example, it seems like the GTDB-Tk team did not want to jump on switching from Prodigal to Pyrodigal without some testing: Use Pyrodigal as Prodigal alternative Ecogenomics/GTDBTk#456 )
Am I correct that setting --genes skips the ANI step entirely? If so, I wonder if using --genes might ultimately cost more time (by forcing pplacer) compared to just re-annotating the genome with Prodigal, for average use cases.

Weighing these advantages and disadvantages, I wonder if most users would not need to use --genes. What do you think? Are there other advantages you foresee? Or do you think this setting might be useful for end-to-end tests and benchmarking of rotary? Thanks!

LeeBergstrand · 2024-04-22T19:32:29Z

On my end, aside from those possible benefits, I can see the following possible disadvantages:

I think, given the issues you brought up and the fact that Prodigal runs fast enough, most of the advantages of running --genes would be mitigated by the slowdown from not doing the ANI search. We can explore this later. Closing for now.

LeeBergstrand added enhancement New feature or request long_term labels Feb 22, 2024

LeeBergstrand closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get gtdbtk to use dfast proteins as input. #142

Get gtdbtk to use dfast proteins as input. #142

LeeBergstrand commented Feb 22, 2024 •

edited

Loading

LeeBergstrand commented Feb 22, 2024

jmtsuji commented Feb 23, 2024

LeeBergstrand commented Apr 22, 2024

Get gtdbtk to use dfast proteins as input. #142

Get gtdbtk to use dfast proteins as input. #142

Comments

LeeBergstrand commented Feb 22, 2024 • edited Loading

LeeBergstrand commented Feb 22, 2024

jmtsuji commented Feb 23, 2024

LeeBergstrand commented Apr 22, 2024

LeeBergstrand commented Feb 22, 2024 •

edited

Loading