Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seroba isnt working on coverage levels below 500X. #82

Open
Gayathri-Guduru opened this issue Dec 4, 2024 · 5 comments
Open

Seroba isnt working on coverage levels below 500X. #82

Gayathri-Guduru opened this issue Dec 4, 2024 · 5 comments

Comments

@Gayathri-Guduru
Copy link

Hi,
seroba_results.xlsx
I tried and tested seroba tool on 8 S.pnuemoniae genome assemblies (isolates) for coverage levels - 1X,5X,10X,50X,70X,100X-1000X.
The steps I followed:
Download genome assemblies from NCBI.
Downsample the reads to 1X,5X,10X,50X,70X,100X-1000X coverage levels for each isolate.
Then cloned the repo: git clone https://github.com/GlobalPneumoSeq/seroba.git
Used the reference database from the latest updated github page: (https://github.com/GlobalPneumoSeq/seroba/). This has a pre-built database with default kmer size 71.
Ran the tool using docker.
Also tested with different kmer sizes (51 and 31) - database setup: seroba createDBs my_database/ 51
The tool worked on 600X and above coverage levels and not below that.
I have attached the results in the below document.

Do you have any advice on parameters that could be tuned in order to reduce high coverage dependency of the tool, to get towards the 10X that is mentioned?

@Oliver-Lorenz-dev
Copy link

Hello!

I will need a bit more information about how you did the downsampling so I can investigate why this is the case.

If you could answer the following questions, I can do some testing:

  • What command did you use to simulate the reads from the genome assemblies?
  • What command did you use to downsample the reads?
  • What are the numbers in the spreadsheet directly below the percentages referring to? - e.g. below 1% there is the number 139 on line 2 of the spreadsheet

We have recently ran SeroBA successfully on a database containing ~26,000 samples where the coverage is much lower than 600X, so it is odd that you cannot get SeroBA to work on such high coverage samples.

Thanks.
Oli

@Gayathri-Guduru
Copy link
Author

Gayathri-Guduru commented Dec 5, 2024

Hi.. Thankyou for quick response.

  1. The following command was used to generate simulated reads from genome assemblies:
    insilicoseq - iss generate --genome "$temp_dir/${filename}.fna" --model novaseq --n_reads 1000000 --output "$output_dir/${filename}_simulated_reads.fastq"

  2. Python Script for Downsampling:
    The Python script utilized for downsampling reads is attached below:

Filename: downsample_reads_sp.txt
downsample_reads_sp.txt

  1. Coverage Levels for Downsampling:
    The input.csv file contains the desired coverage levels for downsampling the original reads.
    input.csv

Example:
In the seroba_results.xlsx spreadsheet:

For GCF_000019825.1_ASM1982v1_genomic.fna, the total bases in the genome are 2,078,953.
Calculating coverage:
1% of total bases: 20,789.
Reads for 1% coverage: 139.
Reads for 5% coverage: 693, and so on.
Reads were downsampled to these coverage levels before running the tools on the downsampled datasets.

Please feel free to ask if you have any questions.
Thanks,
Gayathri Guduru.

@Oliver-Lorenz-dev
Copy link

Hi Gayathri,

I think there is an issue with how you are downsampling the reads and this is causing problems for SeroBA.

The number of reads you have here are so low that almost no bioinformatics tools would work with them.

It would be more appropriate to downsample at the read level rather than the base count level (a genome assembly will always have less bases in it than in the raw reads files). e.g. If you have simulated 1,000,000 reads from the assembly, 1% of this would be 10,000 reads.

I would recommend an alternative approach using real read data for benchmarking a tool. This would mean you don't need to simulate an artificial number of reads from a genome assembly - you can use some samples from the Global Pneumococcal Sequencing (GPS) dataset: https://data-viewer.monocle.sanger.ac.uk/project/gps
You can use the "ERS" column to find the reads on ENA (https://www.ebi.ac.uk/ena/browser/home)

Hope this helps.

Best wishes,
Oli

@Gayathri-Guduru
Copy link
Author

Hi Oli,

Thanks for your response.
I ran the tool on 5 ERS IDs of different years, continents, and countries.
After downloading the fa files of the corresponding ERS ID's, I simulated the reads to 1 million and ran the seroBA tool on those files.

Out of these:

  • 3 ERS IDs accurately predicted the correct serotypes.
  • 1 ERS ID was identified as "untypable."
  • Another ERS ID showed an in silico serotype as "24F," but the tool predicted it as "24B."

I am attaching the results file below.
seroBA_results_S.pnuemoniae.xlsx

@Oliver-Lorenz-dev
Copy link

Hi Gayathri,

Good to hear that SeroBA is running!

The genetic basis of serogroup 24 is heterogenous which makes it difficult to determine for SeroBA. So, for all serogroup 24 isolates that are not serotype 24A, SeroBA types the isolates as 24B/24C/24F.

Hope this helps.

Best wishes,
Oli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants