Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single-end mode is broken in v1.1.0 #247

Open
niederhuth opened this issue Sep 4, 2024 · 17 comments
Open

single-end mode is broken in v1.1.0 #247

niederhuth opened this issue Sep 4, 2024 · 17 comments

Comments

@niederhuth
Copy link

I recently updated my conda installation of pairtools to v1.1.0 and it appears that the --single-end mode no longer works for single-end reads in pairtools parse2:

Comparing two versions:

pairtools 1.0.3 py310hb45ccb3_0 bioconda
pairtools 1.1.0 py312hac03d35_1 bioconda

Same command:
pairtools parse2 --single-end --nproc-in 12 --nproc-out 32 --min-mapq 1 --flip --add-columns mapq --assembly B73 -c ../ref/B73.chrom_sizes.tsv --output-stats test.tsv -o test.pairs test.input.bam

v1.0.3 gives the expected result, but v1.1.0 reports everything as unmapped:

v.1.10 results are entirely NN:
read1 ! 0 ! 0 - - NN 0 0

@Henrikkoe
Copy link

I just encountered the same problem. Furthermore, the parse2 --expand / --no-expand mode in pairtools 1.0.3 does not seem to perform combinatorial expansion and only outputs the unmodified pairtype. Maybe this could be relevant when looking into pairtools parse2.

@golobor
Copy link
Member

golobor commented Sep 9, 2024

hi folks!
Could either of you share a sample .sam file with a few reads?
This could be extremely useful in reproducing the issue!

@Salvacasani
Copy link

I am having the same issue! Did you ever figure out what the problem was? If not, I am happy to provide a sample with a subset of the alignment I am using.

@agalitsyna
Copy link
Member

@Salvacasani That will be excellent! Please provide the command that you run on that sample file and a description of your expectations vs what you observe.

@niederhuth
Copy link
Author

hi folks! Could either of you share a sample .sam file with a few reads? This could be extremely useful in reproducing the issue!

Unfortunately I'm working with private data, so unable to share anything.

@Salvacasani
Copy link

Here it is, I don't know how many of these should have ligated junctions, but there should be a good bunch. If there are no examples with separate mapping sites in this subset, let me know the best way to get them. I am processing the data in the previous version of pairtools. I will let you know if that works.
subsample.zip

@Salvacasani
Copy link

Hi! Have you been able to look into it? I tried to run it with the previous version of pairtools, but I get the same result.

@agalitsyna
Copy link
Member

Hi @Salvacasani ,

You can check out the latest pull request here: #251. It seems that some API changes didn’t fully align with the old functionality, which caused an issue with read side detection in certain datasets.

By the way, in your sample data, most of the reads are NU (unmapped-mapped) pairs, so I wasn’t able to fully test the pair expansion. However, pair expansion should be working now as well.

@Phlya, since you had questions about the read sides in your recent dataset, could you test if any part of your data produces the same results after this fix?

In theory, this issue should have affected both paired-end and single-end data, but it’s only appearing with single-end data, which is a bit puzzling. There might be another underlying issue that remains unresolved.

@Salvacasani
Copy link

Thank you for looking into it. I am still getting very strange results. This is the code that I am using
pairtools parse2 --min-mapq 20 --single-end --max-inter-align-gap 30 --nproc-in 5 --nproc-out 5 --add-columns mapq,pos5,pos3 --chroms-path /seq/epiprod02/reference_genomes/hg38/juicertools/hg38.genome -o subsampled_sorted_chr8.pairs.gz subsampled_sorted_chr8.bam
practically all the reads look like NN pairs, which is strange based on the bam file itself.

To use one specific example. I looked into the results of read 150627-BC36-0251488282

Which seems to have two mapping sites in the bam file:
150627-BC36-0251488282 0 chr8 199436 0 149M61S * 0 0 AGACATAATATTATAATTCAAATGCAACTCATGGCTTCTCTTTGTACTCTTTCTCTAGCTTTTGAATTATTTATTCTAATACCAGTTTTAATTCTGACACAAAATCATGGGAGTTCTAATCAAAATCCAACCTTTTATCATAAAAACTAGCTGTCACTATTGTTAGCGGTGATGGGGCCCTCGGGGACTTTGGAGGGAAGCTCAGGCTGG 48IIII77IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;IIIIIIIII9III1II1IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII<IIIIII:I:IIIIIIIIIIIIIIIIIIIII;II;IIIIIIIIIIIIIIIIIIIIIIIFFI@@iii**IIII/==/DIDII,II,IIFIF77;I>998IIIIII5I,, NM:i:0 MD:Z:149 AS:i:149 XS:i:149 SA:Z:chr8,635368,-,63M147S,43,5; MQ:i:43 ip:i:199584 mp:i:635430 ep:i:199436 rt:A:0 cb:Z:chr8_chr8_0_1_0_0_000199584_000635430 RG:Z:HIC13907
150627-BC36-0251488282 272 chr8 635368 43 63M147H * 0 0 CCAGCCTGAGCTTCCCTCCAAAGTCCCCGAGGGCCCCATCACCGCTAACAATAGTGACAGCTA ,,I5IIIIII899>I>;77FIFII,II,IIDID/==/IIII**III@@IFFIIIIIIIIIIII NM:i:5 MD:Z:29C0A2G1T7A19 AS:i:38 XS:i:0 SA:Z:chr8,199436,+,149M61S,0,0; MQ:i:0 ip:i:635430 mp:i:199584 ep:i:635368 rt:A:1 cb:Z:chr8_chr8_0_1_0_0_000199584_000635430 RG:Z:HIC13907

but in the pairs file it shows as unmapped:

150627-BC36-0251488282 ! 0 ! 0 - - NN 0 0 0 0 0 0
150627-BC36-0251488282 ! 0 ! 0 - - NN 0 0 0 0 0 0

Do you know why this may be?

Thank you

@agalitsyna
Copy link
Member

Hi @Salvacasani,

Hmmm, can you try with pairtools-v1.1.0-fix branch of pairtools?
Basically, you can install that version with:

git clone https://github.com/open2c/pairtools
cd pairtools
git checkout checkout pairtools-v1.1.0-fix
pip install -e ./

With that version in progress, I receive "correct" results with your command:

pairtools parse2 --min-mapq 20 --single-end --max-inter-align-gap 30 --nproc-in 5 --nproc-out 5 --add-columns mapq,pos5,pos3 -c hg38.chrom.sizes.reduced ./subsampled2_sorted_chr8.bam | grep "150627-BC36-0251488282"

Output two pairs, both have some kind of issues (multi-mapper and unmapped segment):

150627-BC36-0251488282  !       0       !       0       -       -       MN              150627-BC36-02514882820chr81994360149M61S*00AGACATAATATTATAATTCAAATGCAACTCATGGCTTCTCTTTGTACTCTTTCTCTAGCTTTTGAATTATTTATTCTAATACCAGTTTTAATTCTGACACAAAATCATGGGAGTTCTAATCAAAATCCAACCTTTTATCATAAAAACTAGCTGTCACTATTGTTAGCGGTGATGGGGCCCTCGGGGACTTTGGAGGGAAGCTCAGGCTGG48IIII77IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;IIIIIIIII9III1II1IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII<II<IIII:I:IIIIIIIIIIIIIIIIIIIII;II;IIIIIIIIIIIIIIIIIIIIIIIFFI@@III**IIII/==/DIDII,II,IIFIF77;>I>998IIIIII5I,,NM:i:0MD:Z:149AS:i:149XS:i:149SA:Z:chr8,635368,-,63M147S,43,5;MQ:i:43ip:i:199584mp:i:635430ep:i:199436rt:A:0cb:Z:chr8_chr8_0_1_0_0_000199584_000635430RG:Z:HIC13907Yt:Z:MN      0       0       0       0       0       0
150627-BC36-0251488282  !       0       chr8    635368  -       +       NU              150627-BC36-0251488282272chr86353684363M147H*00CCAGCCTGAGCTTCCCTCCAAAGTCCCCGAGGGCCCCATCACCGCTAACAATAGTGACAGCTA,,I5IIIIII899>I>;77FIFII,II,IIDID/==/IIII**III@@IFFIIIIIIIIIIIINM:i:5MD:Z:29C0A2G1T7A19AS:i:38XS:i:0SA:Z:chr8,199436,+,149M61S,0,0;MQ:i:0ip:i:635430mp:i:199584ep:i:635368rt:A:1cb:Z:chr8_chr8_0_1_0_0_000199584_000635430RG:Z:HIC13907Yt:Z:NU    0       43      0       635368  0       635430

Is that what you would expect for that read?

@Salvacasani
Copy link

Thanks I can reproduce this result. Since there is a multimaper (not sure why since I don't find it in the alignment) maybe this is not the best example.

If we look at this other example 180786-UGAv3-23-4079533886

This is the bam file:

180786-UGAv3-23-4079533886 0 chr8 310577 42 22S162M57S * 0 0 CTACACGACGCTCTTCCGATCTACTATACCTTCATTCATTAATGTGTCATTTCTTTCAGGAACTATTTTCTGAGTCTCAAACATATTTCATAGCACTCTCAAGCTTGAGGTTCTGCCTGAACATGCTCCTCCTCCTTTTCTTTTGTGACTCTTCATTTCGTATGGAGTTAAACCTGAGCTCCTAAGCTTTGTCTGTTTCCACTTTTGTCAGATTCGATGGCAACAAGATGAATGCGCCCTG 88IIIIIIIIIIIIIIIIIIII;II7IIIIIIIIIIIIIIIIIII:IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIICCIIIIII@DD@I9==9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII7I7I"III?I?==II'88'&-IIII11II,I++1//I++86,$99I6#I&?&I4 NM:i:0 MD:Z:162 AS:i:162 XS:i:142 SA:Z:chr8,312049,+,191S37M1D13M,48,2; XA:Z:chr1,-848200,57S162M22S,4; MQ:i:48 ip:i:310738 mp:i:312049 ep:i:310555 rt:A:0 cb:Z:chr8_chr8_0_1_0_16_000310738_000312049 RG:Z:HIC13904
180786-UGAv3-23-4079533886 256 chr8 312049 48 191H37M1D13M * 0 0 TCTGTTTCCACTTTTGTCAGATTCGATGGCAACAAGATGAATGCGCCCTG "III?I?==II'88'&-IIII11II,I++1//I++86,$99I6#I&?&I4 NM:i:2 MD:Z:37^T4G8 AS:i:38 XS:i:21 SA:Z:chr8,310577,+,22S162M57S,42,0; MQ:i:42 ip:i:312049 mp:i:310738 ep:i:312099 rt:A:1 cb:Z:chr8_chr8_0_1_0_16_000310738_000312049 RG:Z:HIC13904

this is the pairtools result:

180786-UGAv3-23-4079533886 chr8 310577 ! 0 + - UN 180786-UGAv3-23-40795338860chr83105774222S162M57S00CTACACGACGCTCTTCCGATCTACTATACCTTCATTCATTAATGTGTCATTTCTTTCAGGAACTATTTTCTGAGTCTCAAACATATTTCATAGCACTCTCAAGCTTGAGGTTCTGCCTGAACATGCTCCTCCTCCTTTTCTTTTGTGACTCTTCATTTCGTATGGAGTTAAACCTGAGCTCCTAAGCTTTGTCTGTTTCCACTTTTGTCAGATTCGATGGCAACAAGATGAATGCGCCCTG88IIIIIIIIIIIIIIIIIIII;II7IIIIIIIIIIIIIIIIIII:IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIICCIIIIII@DD@I9==9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII7I7I"III?I?==II'88'&-IIII11II,I++1//I++86,$99I6#I&?&I4NM:i:0MD:Z:162AS:i:162XS:i:142SA:Z:chr8,312049,+,191S37M1D13M,48,2;XA:Z:chr1,-848200,57S162M22S,4;MQ:i:48ip:i:310738mp:i:312049ep:i:310555rt:A:0cb:Z:chr8_chr8_0_1_0_16_000310738_000312049RG:Z:HIC13904Yt:Z:UN 42 0 310577 0 310738 0
180786-UGAv3-23-4079533886 ! 0 chr8 312099 - - NU 180786-UGAv3-23-4079533886256chr831204948191H37M1D13M
00TCTGTTTCCACTTTTGTCAGATTCGATGGCAACAAGATGAATGCGCCCTG"III?I?==II'88'&-IIII11II,I++1//I++86,$99I6#I&?&I4NM:i:2MD:Z:37^T4G8AS:i:38XS:i:21SA:Z:chr8,310577,+,22S162M57S,42,0;MQ:i:42ip:i:312049mp:i:310738ep:i:312099rt:A:1cb:Z:chr8_chr8_0_1_0_16_000310738_000312049RG:Z:HIC13904Yt:Z:NU 0 48 0 312099 0 312049

Maybe I misunderstand the alignment, but isn't it indicating that these two are chimeric alignments from the same read? If that is the case, shouldn't these be pairs and be in the same line of the pairs file as linked fragments?

Thank you!

Salva

@agalitsyna
Copy link
Member

agalitsyna commented Sep 28, 2024

@Salvacasani , that happens because your bam file is not sorted by read ID, and pairtools parse relies on parts of the same read arranged next to each other. Simple reordering of your .bam file with:

samtools sort -n subsampled2_sorted_chr8.bam > subsampled2_re-sorted_chr8.bam

produces these three nice pairs for the read 180786-UGAv3-23-4079533886:

parse2 --no-flip --drop-sam --drop-seq --single-end --max-inter-align-gap 20 --nproc-in 5 --nproc-out 5 --add-columns mapq -c hg38.chrom.sizes.reduced ./subsampled2_re-sorted_chr8.bam
180786-UGAv3-23-4079533886      !       0       chr8    310738  -       -       NU      0       42
180786-UGAv3-23-4079533886      chr8    310577  chr8    312099  +       -       UU      42      48 

I thought that we mention in the manual that the input alignments shall be sorted by the read ID, but I cannot find any mention of that. Attention @golobor , it might be worthy to add it.

@Salvacasani
Copy link

Gotcha, Thank you so much for the help!!

@Henrikkoe
Copy link

I tried it with the new version pairtools-v1.1.0-fix and it still does not work for me.

This is the command I used:

pairtools parse2 --output-stats file.pairs.stats  -c GRCh38.rg.fa.fai  --single-end --readid-transform 'readID.split(":")[0]' --drop-sam --drop-seq --add-pair-index --add-columns mapq,pos5,pos3,cigar,read_len,matched_bp,algn_ref_span,algn_read_span,dist_to_5,dist_to_3,mismatches file.bam > file.pairs 

Output example for pairtools-v1.1.0-fix

62716bd6-e872-4f1c-a787-00906707b280	chr3	38712540	!	0	+	-	UN	1	R1	60	0	38712540	0	38713313	0	1S140M1I4M1I76M1D240M3D86M1I16M2I1M1I2M1I2M1D148M2D52M	*	775	0	767	0	774	0	774	0	1	0	0	0		
62716bd6-e872-4f1c-a787-00906707b280	chr3	38711912	!	0	-	-	UN	1	R1	11	0	38711912	0	38711850	0	52M1D10M	*	62	0	62	0	63	0	62	0	0	0	0	0		
62716bd6-e872-4f1c-a787-00906707b280	chr3	38711849	!	0	-	-	UN	1	R1	60	0	38711849	0	38711626	0	155M1I16M1D3M1I49M	*	225	0	223	0	224	0	225	0	0	0	0	0		
62716bd6-e872-4f1c-a787-00906707b280	chr3	38711625	!	0	-	-	UN	1	R1	27	0	38711625	0	38711538	0	88M	*	88	0	88	0	88	0	88	0	0	0	0	0		

Output example for pairtools 1.0.3

62716bd6-e872-4f1c-a787-00906707b280	chr3	38711912	chr3	38711626	-	+	UU	1	R1	11	60	38711912	38711626	38711850	38711849	52M1D10M	155M1I16M1D3M1I49M	62	225	62	223	63	224	62	225	0	0	0	0		
62716bd6-e872-4f1c-a787-00906707b280	chr3	38711849	chr3	38711538	-	+	UU	2	R1	60	27	38711849	38711538	38711626	38711625	155M1I16M1D3M1I49M	88M	225	88	223	88	224	88	225	88	0	0	0	0		
62716bd6-e872-4f1c-a787-00906707b280	chr3	38711625	chr3	38712008	-	+	UU	3	R1	27	60	38711625	38712008	38711538	38712146	88M	86M1I53M	88	140	88	139	88	139	88	140	0	0	0	0		
62716bd6-e872-4f1c-a787-00906707b280	chr3	38712146	chr3	38710966	-	-	UU	4	R1	60	60	38712146	38710966	38712008	38710569	86M1I53M	137M1I110M1I151M	140	400	139	398	139	398	140	400	0	0	0	0		

I can as well provide a small subset bam file. Just tell me where I should send it.

@agalitsyna
Copy link
Member

@Henrikkoe , thanks! That might be due to readID transformation. I believe I have brought it now to the working state in the same branch:
#251

Do you mind pulling the most recent version of the branch and checking the output?

If the problem persists, you can run the following:

samtools view -h file.bam | head -n 1000 | samtools view -bS - > sample.bam

and post the file either through zip attachements to github messages or though OSF for further debug.

@Henrikkoe
Copy link

Thanks, this seems to be working now.

I have a follow up question to the --single-end mode but I can also shift this into a new issue. But this might be also related.

Could I use the --single-end command together with --expand? Because in combination the output is the same to --single-end only without any combinatorial expansion and using only--expand ends up in a weird combination and small number of interactions.

--single-end only or --single-end + --expand

b5bc7155-a245-4aeb-9c79-d231e824fcaa	chr8	81338701	chr8	81387024	+	+	UU	20	R1	60	60	81338701	81387024	81339221	81387212	17S102M1D23M1D20M3D5M1D17M1I5M1D57M1D35M1I2M1D13M1I1M2I2M1D34M3D20M1D5M1D6M1D38M1D1M1D95M3D2M1D16M	14M1I28M1I47M1D2M1D96M43S	521	232	499	187	521	189	504	189	17	43	0	0		
b5bc7155-a245-4aeb-9c79-d231e824fcaa	chr8	81387212	chr8	81384871	-	+	UU	21	R1	60	60	81387212	81384871	81387024	81385252	14M1I28M1I47M1D2M1D96M43S	160M1D114M3D104M94S	232	472	187	378	189	382	189	378	43	94	0	0		
b5bc7155-a245-4aeb-9c79-d231e824fcaa	chr8	81385252	chr8	81623567	-	-	UU	22	R1	60	60	81385252	81623567	81384871	81623350	160M1D114M3D104M94S	348S63M1D154M1S	472	566	378	217	382	218	378	217	94	348	0	1		

--expand only

b5bc7155-a245-4aeb-9c79-d231e824fcaa	!	0	chr8	81623350	-	+	NU	1	R1-2	0	60	0	81623350	0	81623567	*	348S63M1D154M1S	0	566	0	217	0	218	0	217	0	348	0	1		
b5bc7155-a245-4aeb-9c79-d231e824fcaa	!	0	chr8	81369851	-	+	NU	1	E1_R1-2	0	60	0	81369851	0	81370157	*	80M1I2M3D127M1D12M2D49M1D10M3D17M22S	0	320	0	297	0	307	0	298	0	0	0	22		
b5bc7155-a245-4aeb-9c79-d231e824fcaa	!	0	chr8	81369851	+	+	MU	2	R2	0	60	0	81369851	0	81370157	75M1I93M	80M1I2M3D127M1D12M2D49M1D10M3D17M22S	169	320	168	297	168	307	169	298	0	0	0	22		

@agalitsyna
Copy link
Member

@Henrikkoe Good catch! This issue was inherited from earlier versions of pairtools, so thanks for pointing it out. Now --single-end + --expand should work as expected. Based on your feedback, I've also added tests for single-end and expanded regimes. Let us know if you notice any unexpected behavior!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants