-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seroba isnt working on coverage levels below 500X. #82
Comments
Hello! I will need a bit more information about how you did the downsampling so I can investigate why this is the case. If you could answer the following questions, I can do some testing:
We have recently ran SeroBA successfully on a database containing ~26,000 samples where the coverage is much lower than 600X, so it is odd that you cannot get SeroBA to work on such high coverage samples. Thanks. |
Hi.. Thankyou for quick response.
Filename: downsample_reads_sp.txt
Example: For GCF_000019825.1_ASM1982v1_genomic.fna, the total bases in the genome are 2,078,953. Please feel free to ask if you have any questions. |
Hi Gayathri, I think there is an issue with how you are downsampling the reads and this is causing problems for SeroBA. The number of reads you have here are so low that almost no bioinformatics tools would work with them. It would be more appropriate to downsample at the read level rather than the base count level (a genome assembly will always have less bases in it than in the raw reads files). e.g. If you have simulated 1,000,000 reads from the assembly, 1% of this would be 10,000 reads. I would recommend an alternative approach using real read data for benchmarking a tool. This would mean you don't need to simulate an artificial number of reads from a genome assembly - you can use some samples from the Global Pneumococcal Sequencing (GPS) dataset: https://data-viewer.monocle.sanger.ac.uk/project/gps Hope this helps. Best wishes, |
Hi Oli, Thanks for your response. Out of these:
I am attaching the results file below. |
Hi Gayathri, Good to hear that SeroBA is running! The genetic basis of serogroup 24 is heterogenous which makes it difficult to determine for SeroBA. So, for all serogroup 24 isolates that are not serotype 24A, SeroBA types the isolates as 24B/24C/24F. Hope this helps. Best wishes, |
Hi,
seroba_results.xlsx
I tried and tested seroba tool on 8 S.pnuemoniae genome assemblies (isolates) for coverage levels - 1X,5X,10X,50X,70X,100X-1000X.
The steps I followed:
Download genome assemblies from NCBI.
Downsample the reads to 1X,5X,10X,50X,70X,100X-1000X coverage levels for each isolate.
Then cloned the repo: git clone https://github.com/GlobalPneumoSeq/seroba.git
Used the reference database from the latest updated github page: (https://github.com/GlobalPneumoSeq/seroba/). This has a pre-built database with default kmer size 71.
Ran the tool using docker.
Also tested with different kmer sizes (51 and 31) - database setup: seroba createDBs my_database/ 51
The tool worked on 600X and above coverage levels and not below that.
I have attached the results in the below document.
Do you have any advice on parameters that could be tuned in order to reduce high coverage dependency of the tool, to get towards the 10X that is mentioned?
The text was updated successfully, but these errors were encountered: