Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about difference of detected cytosine sites between CX_report.txt and bismark.cov.gz #647

Open
yinzhuo0322 opened this issue Jan 3, 2024 · 4 comments

Comments

@yinzhuo0322
Copy link

yinzhuo0322 commented Jan 3, 2024

Hi FelixKrueger,

I am little confused about the number of detected Cytosine in CX_report.txt and bismark.cov.gz.

With the option --CX, the coverage file (bismark.cov.gz.) is extended to cytosines in any sequence context, containing <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>.
As to CX_report.txt, it contains<chromosome> <position> <strand> <count methylated> <count unmethylated> <C-context> <trinucleotide context> and will output every cytosine in the whole genome irrespective of detected or not. But, when I filter CX_report.txt by $4+$5 >0, the detected Cytosine number in the CX_report.txt is still not the same with the Cytosine number in bismark.cov.gz.

An example is listed here:

$ less example.bismark.cov.gz|wc -l
24865358
$  wc -l example.CX_report.txt
1103157779 example.CX_report.txt
$ awk '{if ($4+$5 >0 ) print }' example.CX_report.txt|wc -l
24859916

$ head -10 ../bismark_diff/diff.txt
146572a146573
> chr1  19099661
146576a146578
> chr1  19099667
146577a146580
> chr1  19099676
146578a146582
> chr1  19099680
146579a146584
> chr1  19099689

I checked the extra cytosine in bismark.cov.gz. and found it was the cytosine in mm10 genome. So, I am very confused why the number of detected cytosines in bismark.cov.gz. is 5442 more than that in CX_report.txt. I wonder if I misunderstood something. I am not sure whether I have expressed my thoughts clearly. Please let me know if you require any additional information to assist with your response.

Thanks for your attention and patience!
Looking forward to your reply!

Best regards,

Yingchuo Hu

@FelixKrueger
Copy link
Owner

Dear Yingchuo Hu,

I believe you are running into an issue that has come up before, and I have tried to summarise it here: https://felixkrueger.github.io/Bismark/faq/context_change/

Could you take a look and see it this answers your questions?

@yinzhuo0322
Copy link
Author

Dear FelixKrueger,

Thank you for your prompt and thoughtful response!

This is quite intriguing! I have never thought the possibility that some detected sites have a different cytosine context between the coverage and genome-wide cytosine reports due to deletions. But, I still get confused about it. I mean, if a sequence has been successfully mapped, then the cytosine context is determined, so why would deletion happen? Could it be that mapping and methylation calling employ different strategies for context determination?

Furthermore, even in cases where a detected site displays a disparate cytosine context between the coverage and genome-wide cytosine reports, the total number of detected cytosines in the CX_report should theoretically be the same with that in the coverage file. After all, cytosine in both reports extend to cytosines in any sequence context. The cytosine context should not affect the number of sites detection.

Sorry to bother you again, but I'm still want to know where I might have made a misunderstanding.

Sincerely,

Yingchuo Hu

@FelixKrueger
Copy link
Owner

I am not quite sure I understand the question "so why would deletion happen". Generally, the mapping is carried out first, and then the methylation call is performed afterwards.

MMMMMMMMMMMMMMMMMMCAATGGGAMMMMMMMMMMMMMMMMMM genome
MMMMMMMMMMMMMMMMMMC---GGGAMMMMMMMMMMMMMMMMMM  read

In this example, M can be just any matching genomic base, but the read has a 3bp deletion in the middle. These things just happen. In case of the cytosine report the C would be in CHH context (CAA), but in the coverage file it would count as CpG (CGG).

Regarding the second issue, I tend to agree that in non-CG context the coverage and CX reports should have the same number of positions. So I went to have a look at the code and spotted that there may have been a bug in non-CG mode that dates back a long time! I have now tried to change that (d90d47d), would you mind cloning the latest version and using the dev branch [email protected]:FelixKrueger/Bismark.git to see if it fixes the discrepancy in the number?

Kudos for spotting this, it might have gone unnoticed for a very long time, most likely because mammalian data is typically never run in non-CG mode...)

@yinzhuo0322
Copy link
Author

Dear FelixKrueger,

I apologize for the delay in my response. Thank you for your encouragement. Due to the special experiment design, we are curious about the modification of mammalian genome in the CH context . I've been struggled with the modified Bismark_methylation_extractor and have tried lots of ways to resolve the issue. Despite creating a new conda environment and cloning the latest dev branch from Git, I consistently encounter the same results.

I am reaching out to seek your expertise to identify any potential mistakes on my part or something else. Your insights on this matter would be highly valuable.

Thank you for your time and patience.

Best regards,

Yingchuo Hu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants