Seroba isnt working on coverage levels below 500X. #82

Gayathri-Guduru · 2024-12-04T18:04:41Z

Hi,
seroba_results.xlsx
I tried and tested seroba tool on 8 S.pnuemoniae genome assemblies (isolates) for coverage levels - 1X,5X,10X,50X,70X,100X-1000X.
The steps I followed:
Download genome assemblies from NCBI.
Downsample the reads to 1X,5X,10X,50X,70X,100X-1000X coverage levels for each isolate.
Then cloned the repo: git clone https://github.com/GlobalPneumoSeq/seroba.git
Used the reference database from the latest updated github page: (https://github.com/GlobalPneumoSeq/seroba/). This has a pre-built database with default kmer size 71.
Ran the tool using docker.
Also tested with different kmer sizes (51 and 31) - database setup: seroba createDBs my_database/ 51
The tool worked on 600X and above coverage levels and not below that.
I have attached the results in the below document.

Do you have any advice on parameters that could be tuned in order to reduce high coverage dependency of the tool, to get towards the 10X that is mentioned?

Oliver-Lorenz-dev · 2024-12-05T16:00:45Z

Hello!

I will need a bit more information about how you did the downsampling so I can investigate why this is the case.

If you could answer the following questions, I can do some testing:

What command did you use to simulate the reads from the genome assemblies?
What command did you use to downsample the reads?
What are the numbers in the spreadsheet directly below the percentages referring to? - e.g. below 1% there is the number 139 on line 2 of the spreadsheet

We have recently ran SeroBA successfully on a database containing ~26,000 samples where the coverage is much lower than 600X, so it is odd that you cannot get SeroBA to work on such high coverage samples.

Thanks.
Oli

Gayathri-Guduru · 2024-12-05T17:42:45Z

Hi.. Thankyou for quick response.

The following command was used to generate simulated reads from genome assemblies:
insilicoseq - iss generate --genome "$temp_dir/${filename}.fna" --model novaseq --n_reads 1000000 --output "$output_dir/${filename}_simulated_reads.fastq"
Python Script for Downsampling:
The Python script utilized for downsampling reads is attached below:

Filename: downsample_reads_sp.txt
downsample_reads_sp.txt

Coverage Levels for Downsampling:
The input.csv file contains the desired coverage levels for downsampling the original reads.
input.csv

Example:
In the seroba_results.xlsx spreadsheet:

For GCF_000019825.1_ASM1982v1_genomic.fna, the total bases in the genome are 2,078,953.
Calculating coverage:
1% of total bases: 20,789.
Reads for 1% coverage: 139.
Reads for 5% coverage: 693, and so on.
Reads were downsampled to these coverage levels before running the tools on the downsampled datasets.

Please feel free to ask if you have any questions.
Thanks,
Gayathri Guduru.

Oliver-Lorenz-dev · 2024-12-06T09:47:05Z

Hi Gayathri,

I think there is an issue with how you are downsampling the reads and this is causing problems for SeroBA.

The number of reads you have here are so low that almost no bioinformatics tools would work with them.

It would be more appropriate to downsample at the read level rather than the base count level (a genome assembly will always have less bases in it than in the raw reads files). e.g. If you have simulated 1,000,000 reads from the assembly, 1% of this would be 10,000 reads.

I would recommend an alternative approach using real read data for benchmarking a tool. This would mean you don't need to simulate an artificial number of reads from a genome assembly - you can use some samples from the Global Pneumococcal Sequencing (GPS) dataset: https://data-viewer.monocle.sanger.ac.uk/project/gps
You can use the "ERS" column to find the reads on ENA (https://www.ebi.ac.uk/ena/browser/home)

Hope this helps.

Best wishes,
Oli

Gayathri-Guduru · 2024-12-10T14:35:08Z

Hi Oli,

Thanks for your response.
I ran the tool on 5 ERS IDs of different years, continents, and countries.
After downloading the fa files of the corresponding ERS ID's, I simulated the reads to 1 million and ran the seroBA tool on those files.

Out of these:

3 ERS IDs accurately predicted the correct serotypes.
1 ERS ID was identified as "untypable."
Another ERS ID showed an in silico serotype as "24F," but the tool predicted it as "24B."

I am attaching the results file below.
seroBA_results_S.pnuemoniae.xlsx

Oliver-Lorenz-dev · 2024-12-11T10:20:38Z

Hi Gayathri,

Good to hear that SeroBA is running!

The genetic basis of serogroup 24 is heterogenous which makes it difficult to determine for SeroBA. So, for all serogroup 24 isolates that are not serotype 24A, SeroBA types the isolates as 24B/24C/24F.

Hope this helps.

Best wishes,
Oli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seroba isnt working on coverage levels below 500X. #82

Seroba isnt working on coverage levels below 500X. #82

Gayathri-Guduru commented Dec 4, 2024

Oliver-Lorenz-dev commented Dec 5, 2024

Gayathri-Guduru commented Dec 5, 2024 •

edited

Loading

Oliver-Lorenz-dev commented Dec 6, 2024

Gayathri-Guduru commented Dec 10, 2024

Oliver-Lorenz-dev commented Dec 11, 2024

Seroba isnt working on coverage levels below 500X. #82

Seroba isnt working on coverage levels below 500X. #82

Comments

Gayathri-Guduru commented Dec 4, 2024

Oliver-Lorenz-dev commented Dec 5, 2024

Gayathri-Guduru commented Dec 5, 2024 • edited Loading

Oliver-Lorenz-dev commented Dec 6, 2024

Gayathri-Guduru commented Dec 10, 2024

Oliver-Lorenz-dev commented Dec 11, 2024

Gayathri-Guduru commented Dec 5, 2024 •

edited

Loading