-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⬆️ 🎨 Allow the use of gtdb taxonomy in Autometa #284
Conversation
🎨 Decouple taxonomy functions
Replace dbdir with dbpath in vote.py line 103
|
One thing I noticed while testing the scripts: |
🎨🐍 Add ability to download and format GTDB databases 🎨📝 Add documentation for using GTDB databases 🎨🐚🐛 Update entrypoints in bash workflows to use dbdir and dbtype 🎨🐍 Add gtdb section to default.config
Add databases.rst
Codecov ReportBase: 27.40% // Head: 27.35% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dev #284 +/- ##
==========================================
- Coverage 27.40% 27.35% -0.06%
==========================================
Files 48 50 +2
Lines 5469 5733 +264
==========================================
+ Hits 1499 1568 +69
- Misses 3970 4165 +195
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
🐛 Replace NCBI init kwarg dirpath to dbdir ✅ Update test_vote.py main and assign tests with main logic and updated NCBI call ✅ Update NCBI fixture in conftest.py 🔥 Remove unused types in NCBI.py
@@ -26,7 +26,7 @@ jobs: | |||
strategy: | |||
matrix: | |||
os: [ubuntu-latest] | |||
python-version: [3.7] | |||
python-version: [3.8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the python version for tests being changed here? Is this necessary for gtdb_to_taxdump
installation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was giving an error with 3.7
, link. from typing import Union, List, Literal
is only supported in versions >=3.8
🎨🔥 Remove logger comment in taxonomy/database.py 🎨 Add abstractmethod decorator to mandatory methods in TaxonomyDatabase.py 📝 Update documentation for working copy/paste for database configuration 📝 Update documentation for working copy/paste when following bash workflow
…oad proteins_aa_reps
autometa-env.yml
Outdated
- pip: | ||
- gtdb_to_taxdump # https://github.com/nick-youngblut/gtdb_to_taxdump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be replaced by a conda distribution of gtdb_to_taxdump
(will need to be made) to follow the bioconda recipe guidelines . This will help ensure autometa remains up-to-date and compatible with other bioinformatic tools. By the way, this tool can be submitted to bioconda on the behalf of the author. You should be able to submit a gtdb_to_taxdump
bioconda recipe by following the bioconda contributing docs.
- pip: | |
- gtdb_to_taxdump # https://github.com/nick-youngblut/gtdb_to_taxdump | |
- gtdb_to_taxdump>=0.1.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Database download and formatting appears to be working. Just need gtdb_to_taxdump
to be available on bioconda then we can update the Autometa recipe and get this merged in
With gtdb_to_taxdump
(v 0.1.8):
Install this branch (and environment)
# In root of Autometa repo
# update env
mamba env update -n autometa -f=autometa-env.yml
# install autometa comands
make clean && make install
Set config to point to where we want formatted GTDB database files
autometa-config \
--section databases \
--option gtdb \
--value $HOME/autometa-testing/databases/gtdb
Download/format GTDB databases to configured path
NOTE:
$HOME
is not actually emitted. I've replaced the actual path with$HOME
for security reasons.
(autometa) evan@userserver:~/Autometa$ autometa-update-databases --update-gtdb
[08/04/2022 05:10:33 PM DEBUG] autometa.taxonomy.gtdb: Running gtdb_to_taxdump.py: gtdb_to_taxdump.py https://data.gtdb.ecogenomic.org/releases/latest/ar53_taxonomy.tsv.gz https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv.gz --outdir $HOME/autometa-testing/databases/gtdb
2022-08-04 17:10:34,175 - Loading: https://data.gtdb.ecogenomic.org/releases/latest/ar53_taxonomy.tsv.gz
2022-08-04 17:10:35,548 - Loading: https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv.gz
2022-08-04 17:10:44,434 - File written: $HOME/autometa-testing/databases/gtdb/names.dmp
2022-08-04 17:10:44,434 - File written: $HOME/autometa-testing/databases/gtdb/nodes.dmp
2022-08-04 17:10:45,451 - File written: $HOME/autometa-testing/databases/gtdb/delnodes.dmp
2022-08-04 17:10:45,452 - File written: $HOME/autometa-testing/databases/gtdb/merged.dmp
[08/04/2022 05:10:45 PM DEBUG] autometa.config.databases: starting GTDB databases download
[08/04/2022 05:10:45 PM DEBUG] autometa.config.databases: wget https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_reps/gtdb_proteins_aa_reps.tar.gz -O $HOME/autometa-testing/databases/gtdb/gtdb_proteins_aa_reps.tar.gz
[08/04/2022 06:11:46 PM DEBUG] autometa.taxonomy.gtdb: Running gtdb_to_diamond.py: gtdb_to_diamond.py $HOME/autometa-testing/databases/gtdb/gtdb_proteins_aa_reps.tar.gz $HOME/autometa-testing/databases/gtdb/names.dmp $HOME/autometa-testing/databases/gtdb/nodes.dmp --outdir $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond
2022-08-04 18:11:46,417 - Read nodes.dmp file: $HOME/autometa-testing/databases/gtdb/nodes.dmp
2022-08-04 18:11:46,500 - File written: $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/nodes.dmp
2022-08-04 18:11:46,500 - Reading dumpfile: $HOME/autometa-testing/databases/gtdb/names.dmp
2022-08-04 18:11:47,561 - File written: $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/names.dmp
2022-08-04 18:11:47,561 - No. of accession<=>taxID pairs: 401808
2022-08-04 18:11:47,561 - Extracting tarball: $HOME/autometa-testing/databases/gtdb/gtdb_proteins_aa_reps.tar.gz
2022-08-04 18:11:47,561 - Extracting to: gtdb_to_diamond_TMP
2022-08-04 18:24:08,234 - No. of .faa(.gz) files: 65703
2022-08-04 18:24:08,256 - Creating accession2taxid table...
2022-08-04 18:24:08,376 - File written: $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/accession2taxid.tsv
2022-08-04 18:24:08,376 - Formating & merging faa files...
2022-08-04 18:44:45,876 - File written: $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/gtdb_all.faa
2022-08-04 18:44:45,876 - No. of seqs. written: 200505361
2022-08-04 18:44:51,908 - Temp-dir removed: gtdb_to_diamond_TMP
[08/04/2022 06:44:51 PM DEBUG] autometa.common.external.diamond: diamond makedb --in $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/gtdb_all.faa --db $HOME/autometa-testing/databases/gtdb/gtdb_to_diamond/gtdb_all.dmnd -p 96
(autometa) evan@userserver:~/Autometa$
Looking at submitting a bioconda recipe for gtdb_to_taxdump and in the README it suggests using TaxonKit for retrieving stable taxids across GTDB releases (rather than arbitrarily assigning them as noted). Below is a screenshot from the gtdb_to_taxdump summary section: Next Steps Towards StabilityTaxonKit is available via bioconda so would be easy to include in TaxonKit conda page: https://anaconda.org/bioconda/taxonkit NOTE: TaxonKit v0.12 or greater should be used. (ref) The taxdump files may also be downloaded directly from the releases page: https://github.com/shenwei356/gtdb-taxdump/releases which reduces the compute requirements. That being said, if we would like to generate the GTDB taxdump files using this, the steps are outlined here: https://github.com/shenwei356/gtdb-taxdump#steps Looking in to the future, this is probably the more appropriate route as taxids may be relied up across future GTDB releases.
|
Not sure if this affects what y'all are doing: |
Only |
Thanks @shenwei356 |
@WiscEvan The scripts have been updated to use taxonkit's already generated files. I decided to use Setting up the database
Run LCA
Run Majority vote
Run taxonomy
Run summary
By using this we don't need to add any additional dependency as well. Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It does not appear that you are using
delnodes.dmp
ingtdb.py
other than reading the file. - It does not appear that you are using
merged.dmp
ingtdb.py
other than reading the file. - Tests are currently failing and should be double-checked
There are some other comments attached. I'll review more after these are addressed. 👍 🚧 👷
autometa/taxonomy/gtdb.py
Outdated
for dirpath, dirs, files in os.walk(reps_faa): | ||
for file in files: | ||
if file.endswith("_protein.faa"): | ||
faa_file = os.path.join(dirpath, file) | ||
# Regex to get the genome accession from the file path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than walking with multiple for loops, why not just glob _protein.faa
?
genome_protein_faa_filepaths = glob.glob(os.path.join(reps_faa, '*_protein.faa'), recursive=True)
faa_index = {
re.search(r"GCA_\d+.\d?|GCF_\d+.\d?", fpath).group() : fpath
for fpath in glob.glob(os.path.join(reps_faa, '*_protein.faa'), recursive=True)
}
from typing import Dict
genome_protein_faa_filepaths = glob.glob(os.path.join(reps_faa, '*_protein.faa'), recursive=True)
faa_index: Dict[str, str] = {}
for genome_protein_faa_filepath in genome_protein_faa_filepaths:
genome_acc_search = re.search(r"GCA_\d+.\d?|GCF_\d+.\d?", genome_protein_faa_filepath)
if not genome_acc_search:
continue
genome_accession = genome_acc_search.group()
faa_index[genome_accession] = genome_protein_faa_filepath
Co-authored-by: Evan Rees <[email protected]>
🐛📝 Change densne choice to densmap 🎨🔥 Remove uncecessary if/then logic for incorporation of gtdb taxonomy sub-workflow TODO: Update or revert changes to autometa-large-data-mode.sh
🔥🎨🐚 Rename cache name to use $simple_name and $dbtype rather than $prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not sure why you are running in to an error with the autometa-large-data-mode.sh
workflow. From inspecting your output directory, it looks like only 11 contigs were classified as bacteria, so maybe this is causing the issue? Large data mode was not intended for 11 contigs
cache="${outdir}/${simpleName}_${kingdom}_cache" | ||
output_binning="${outdir}/${simpleName}.${kingdom}.${clustering_method}.tsv" | ||
output_main="${outdir}/${simpleName}.${kingdom}.${clustering_method}.main.tsv" | ||
cache="${outdir}/${simple_name}_${kingdom}_cache" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was not a typo, this should read out outdir/sample_name_gtdb_bacteria_cache
or outdir/sample_name_ncbi_bacteria_cache
cache="${outdir}/${simple_name}_${kingdom}_cache" | |
cache="${outdir}/${simple_name}_${dbtype}_${kingdom}_cache" |
echo "Running GTDB taxon assignment step." | ||
|
||
# output | ||
gtdb_orfs="${outdir}/${prefix}.orfs.faa" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think gtdb_orfs
is misleading? These are bacterial and archael ORFs identified against NCBI's nr database. So maybe gtdb_input_orfs
or something else that would be more appropriate? What do you think?
gtdb_orfs="${outdir}/${prefix}.orfs.faa" | |
gtdb_input_orfs="${outdir}/${prefix}.orfs.faa" |
workflows/autometa.sh
Outdated
# $blast --> Generated from step 5.2 | ||
# $gtdb --> User Input | ||
# $dbtype --> Updated according to $taxa_routine | ||
dbtype="gtdb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dbtype="gtdb"
is also on line 247. Is this necessary, or can we delete this?
workflows/autometa.sh
Outdated
# e.g. ${outdir}/${prefix}.gtdb.bacteria.fna | ||
# e.g. ${outdir}/${prefix}.gtdb.archaea.fna | ||
# e.g. ${outdir}/${prefix}.gtdb.taxonomy.tsv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gtdb
, i.e. $dbtype
is included in $prefix
# e.g. ${outdir}/${prefix}.gtdb.bacteria.fna | |
# e.g. ${outdir}/${prefix}.gtdb.archaea.fna | |
# e.g. ${outdir}/${prefix}.gtdb.taxonomy.tsv | |
# e.g. ${outdir}/${prefix}.bacteria.fna | |
# e.g. ${outdir}/${prefix}.archaea.fna | |
# e.g. ${outdir}/${prefix}.taxonomy.tsv |
workflows/autometa.sh
Outdated
|
||
kingdom_fasta="${outdir}/${prefix}.${kingdom}.fna" | ||
|
||
contig_ids="${outdir}/${prefix}.${kingdom}.contigIDs.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This writes sequence headers with the _
suffix. This should be more appropriately named {...}.orfPrefixes.txt
and the variable name changed to orf_id_prefixes
contig_ids="${outdir}/${prefix}.${kingdom}.contigIDs.txt" | |
orf_prefixes="${outdir}/${prefix}.${kingdom}.orfPrefixes.txt" |
NOTE: contig_ids
will need to be changed to orf_prefixes
where appropriate as well
workflows/autometa.sh
Outdated
sed 's/$/_/' | \ | ||
cut -f1 -d" " > $contig_ids | ||
# Retrieve ORF IDs from contig IDs | ||
grep -f $contig_ids $orfs | sed 's/^>//' | cut -f1 -d" " > $orf_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: See comment above
grep -f $contig_ids $orfs | sed 's/^>//' | cut -f1 -d" " > $orf_ids | |
grep -f $orf_prefixes $orfs | sed 's/^>//' | cut -f1 -d" " > $orf_ids |
workflows/autometa.sh
Outdated
# Step 5.2: Run blastp | ||
# input: | ||
# $gtdb_input_orfs --> Generated from step 5.1 | ||
gtdb_dmnd_db="${gtdb}/gtdb.dmnd" # generated using autometa-setup-gtdb (Must be performed prior to using this script) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up using find
here on CHTC, which may be a little more robust. Do you think this is appropriate?
gtdb_dmnd_db="${gtdb}/gtdb.dmnd" # generated using autometa-setup-gtdb (Must be performed prior to using this script) | |
gtdb_dmnd_db=$(find $gtdb -name "gtdb.dmnd") # generated using autometa-setup-gtdb (Must be performed prior to using this script) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bad idea. I can do that.
📝 Fix incorrect unclustered.fasta path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. 👍 Release forthcoming
autometa-setup-gtdb
entrypoint for creating compatible GTDB databasesgtdb_to_taxdump
is now installedTaxonomyDatabase
classCommands to replicate:
Test GTDB
Running
autometa-setup-gtdb
entrypointRunning
autometa-taxonomy-lca
entrypointRunning
autometa-taxonomy-majority-vote
entrypointRunning
autometa-taxonomy
entrypointRunning
autometa-summary
entrypointTest NCBI
Running
autometa-taxonomy-lca
entrypointRunning
autometa-taxonomy-majority-vote
entrypointRunning
autometa-taxonomy
entrypointRunning
autometa-summary
entrypointTODO:
gtdb.py
PR checklist