Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: Get demultiplexing script to calculate coverage for each sample #68

Open
1 of 2 tasks
karinlag opened this issue May 30, 2023 · 18 comments
Open
1 of 2 tasks
Assignees
Labels
enhancement New feature or request

Comments

@karinlag
Copy link
Member

karinlag commented May 30, 2023

Other things that need to change:

@CathrineAB
Copy link

The calculation should be 2x[read length]x[number of reads]/[genome size] and only be performed on R1.

[read length] can be found in the upper parts of the SampleSheet.csv, or in the multiqc_general_stats.txt
[number of reads] from the multiqc_general_stats.txt
[genome size] from the SampleSheet.csv

@georgemarselis-nvi
Copy link
Collaborator

georgemarselis-nvi commented May 30, 2023

/data/rawdata/220603_M06578_0105_000000000-KB7MY/SampleSheet.csv

image

is reads "read length" or is it "reads"? In either case, why is it written twice?

if that is "reads" where is read length, please?

or do i have to process R1?

if i process R1, can I calculate the genome size? that will help not change SampleSheet.csv much.

also 2x[read length]x[number of reads]/[genome size] : that means that reads are always even, but there might be a small discrepancy due to integer division in computers:

 result = 2 * ( ( demux.readLength * numberOfReads ) // genomeSize ) # // is integer division, essentially divide and take math.floor( ) of result

i'm doing the multiplication first and then dividing by genomeSize, which in turn is rounded down instead of up, and then multiplied by 2.

The result will always be even, but it might differ depending on the order of calculation and whether we round up or down in the division.

@CathrineAB
Copy link

In the Samplesheet.csv read length is for some reason called reads. It is written twice since it points to both the forward and reverse read. In the multiqc_general_stats.txt however [read length] is used.

You cannot calculate genome size from anything in the Samplesheet or demuxed stats files. We could provide the constant list of genome sizes for each genus separately if that helps (related to the VIGAS-PID, but then VIGAS-PID needs to be included in the SampleSheet).

I would optimally want the resulting number with one decimal.

@magnulei
Copy link
Collaborator

I have made modifications to the sample sheet template. Changes only applied to the template for the new workflow. Have not done anything with the "old-workflow-template".

This is an example of the updated data section:

image.png

Note that I have also removed "i7_index_id" and "i5_index_id" and replaced it with "index_id".
I have also removed the "Description" column (which will resolve https://github.com/NorwegianVeterinaryInstitute/nvi_lims_epps/issues/28)

@magnulei
Copy link
Collaborator

@georgemarselis-nvi
Copy link
Collaborator

georgemarselis-nvi commented May 31, 2023

awesome! can you please generate a fake SampleSheet.csv for me?

edit: @magnulei send a made-up new SampleSheet , attaching here .

MiSeq_MS-TEST3.csv

@georgemarselis-nvi georgemarselis-nvi added the enhancement New feature or request label May 31, 2023
@georgemarselis-nvi
Copy link
Collaborator

this is a bit unexpected, but it looks I may have to create a SampleSheet class.

Not a lot of work, but basically instead of just searching for 1-2 strings inside the file, looks like i have to properly parse it and then fill in appropriate fields.

https://support-docs.illumina.com/SHARE/SampleSheetv2/Content/SHARE/FrontPages/SampleSheetv2.htm

and

https://support.illumina.com/content/dam/illumina-support/documents/documentation/system_documentation/miseq/miseq-sample-sheet-quick-ref-guide-15028392-j.pdf

@georgemarselis-nvi georgemarselis-nvi changed the title Get demultiplexing script to calculate coverage for each sample feature request: Get demultiplexing script to calculate coverage for each sample May 31, 2023
@georgemarselis-nvi
Copy link
Collaborator

@magnulei can you please give me the dimension of a Plate Well?

A to I ?

and 00 to 10 ?

am i correct?

@georgemarselis-nvi
Copy link
Collaborator

@magnulei same thing as above but for Sample_Well

i want to put in a check that the well is never out of bounds.

@georgemarselis-nvi
Copy link
Collaborator

@magnulei also can you please explain what is index an dhow does that differ from index2 ?

@magnulei
Copy link
Collaborator

@georgemarselis-nvi

Index_Plate_Well and Sample_Well is in the same format.

A01 in the upper left corner.
H12 in the bottom right corner

Like this:

image

Btw:
Sample_Well will definitely never be out of bounds. Its defined in clarity. Index_Plate_Well could in theory be out of bounds since it is getting the position from the index name. A typo here could cause it to be out of bounds in the sample sheet.

@magnulei
Copy link
Collaborator

@magnulei also can you please explain what is index an dhow does that differ from index2 ?

This should answer your question:
https://support.illumina.com/content/dam/illumina-support/documents/documentation/system_documentation/miseq/indexed-sequencing-overview-guide-15057455-08.pdf

If needed we can have a quick call later this week and go trough it.

@georgemarselis-nvi
Copy link
Collaborator

Btw:
Sample_Well will definitely never be out of bounds. Its defined in clarity. Index_Plate_Well could in theory be out of bounds since it is getting the position from the index name. A typo here could cause it to be out of bounds in the sample sheet.

Hm. Is it ok if I add a secondary check, just in case? Just to satisfy my paranoia.

@georgemarselis-nvi
Copy link
Collaborator

Like this:

are there different sizes of plates or is this industry standard?

@magnulei
Copy link
Collaborator

Btw:
Sample_Well will definitely never be out of bounds. Its defined in clarity. Index_Plate_Well could in theory be out of bounds since it is getting the position from the index name. A typo here could cause it to be out of bounds in the sample sheet.

Hm. Is it ok if I add a secondary check, just in case? Just to satisfy my paranoia.

Thats of course perfectly fine by me. And good to catch any well-id-typos in the index names.

@magnulei
Copy link
Collaborator

Like this:

are there different sizes of plates or is this industry standard?

When you say size, I assume you mean format. The 96 well format (8x12) is industry standard. But the specific dimensions could be different depending on the application. You have low-profile, high-profile and deep-well.

You also have 24-well format plates (8x3). But these are not used in clarity.
And you have 8-well strips (8x1). Strickly not a plate though. Not used in clarity

In clarity we are using three different formats (see configuration -> consumables -> containers in the clarity gui):

@georgemarselis-nvi
Copy link
Collaborator

i got knowledge today. thanks!

@magnulei
Copy link
Collaborator

magnulei commented Aug 9, 2023

Hi @georgemarselis-nvi.

I have now created the script for extracting sample metadata at the sequencing step:

https://github.com/NorwegianVeterinaryInstitute/nvi_lims_epps/blob/main/excel-logging/run_log_samples.py

At the moment, the metadata is saved into an excel file. With time it might be pushed to an SQL database instead.
For writing the data, I have written a sorting function to sort the samples in the same order as in the sample sheet. And I have an empty column in the dataframe/excel-file where @CathrineAB can paste the coverage numbers calculated by the demux script.

So this is just to let you know that we would like the list of coverage numbers to be sorted in the same order as in the sample sheet (but guess that was the plan anyway?).

Feel free to pop by my office if you want to discuss the above script and/or the temporary metadata "excel-logging" solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants