Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database Download & Documentation #141

Open
gbouras13 opened this issue Aug 29, 2023 · 5 comments
Open

Database Download & Documentation #141

gbouras13 opened this issue Aug 29, 2023 · 5 comments

Comments

@gbouras13
Copy link

gbouras13 commented Aug 29, 2023

Gday @psj1997 @luispedro and other Semibin developers,

Firstly thanks for Semibin(2) - it works amazingly well, so many bins recovered compared to other binning methods :)

I want to share some feedback regarding database download and Semibin's documentation.

The HPC cluster I use at my institution blocks internet access on compute nodes. Therefore, lazily downloading the Semibin2 database did not work when I ran the below command (Semibin v1.5.1, Linux installation via bioconda).

SemiBin2 multi_easy_bin -i {input.catalogue}  -b {input.bams} -o {params.outdir} -s {params.separator} --minfasta-kbs {params.minfasta}

It was difficult for me to figure out that this was in fact the error, because a database isn't mentioned in the readme and only in the FAQs of the docs, and the error message wasn't informative (apologies I have overwritten the log file or I would quote it).

I then tried following the FAQs of the docs to download the updated GTDB database, the following does not work in MMseqs2 v13.45111 (with this known MMSeqs2 error soedinglab/MMseqs2#561)

mmseqs databases GTDB GTDB tmp

Then, after looking at the Semibin codebase I was able to install the database manually:

wget 'https://zenodo.org/record/4751564/files/GTDB_v95.tar.gz?download=1'
mv GTDB_v95.tar.gz?download=1  GTDB_v95.tar.gz
tar -xzvf GTDB_v95.tar.gz

and went from there, specifying -r {params.db} and then semibin worked perfectly.

So perhaps either including a specific --download_database flag or script, or just documenting a manual install method would help future users like me without compute node internet access.

George

@luispedro
Copy link
Member

If you are calling SemiBin2, it should not be downloading the MMSeqs DB anymore. I will check again whether we had not mistakenly kept that in.

@jolespin
Copy link

jolespin commented Dec 5, 2024

@luispedro I'm trying to integrate SemiBin2 into my VEBA (https://github.com/jolespin/veba) metagenomics software suite for the prokaryotic binning module. A few questions:

  1. Can I skip this step? If so, how?
  2. If I have a redownloaded GTDB (that I use for GTDB-Tk), how can I create a MMseqs2 taxonomy version of it that is compatible with SemiBin2?

@luispedro
Copy link
Member

SemiBin2 should not be downloading the database anymore and, in our tests, it does not.

@luispedro
Copy link
Member

@jolespin : is SemiBin2 downloading the database? With default parameters, it should not

@jolespin
Copy link

jolespin commented Dec 9, 2024

@jolespin : is SemiBin2 downloading the database? With default parameters, it should not

I did some test runs and it's not in the newest version. I was mostly going through all the parameters before my first run to make sure I was doing it correctly.

Do you recommend a way I can precomputed abundance tables for 1 sample and for <5 samples that will be compatible with semibin2? I typically use CoverM but open to other methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants