Identical read_len1 and read_len2 reported with different real read lengths in R1&2 pairs #262

Phlya · 2025-01-14T10:31:21Z

I have some data with different lengths of read1 (81 nt) and read2 (221 nt) and I want to report the length of the read in the pairs. I add

    --add-columns pos5,pos3,read_len,mapq \

to parse2, but in the pairs, I think depending on walk pair type, the read length is sometimes reported incorrectly.
In the case of R1-2 reads both values are reported, in case R1 or R2 reads only the appropriate value is reported. I think this is correct behaviour, but perhaps should be clarified in the docs?

However, in R1&2 pairs both read lengths are reported as 81 nt, while I would expect both 81 and 221 nt reported. This is not what I would expect in this situation, is this intended behaviour?

The text was updated successfully, but these errors were encountered:

agalitsyna · 2025-01-14T10:53:17Z

It looks like your use case implies that read lengths for left and right alignments in the pair should have their lengths reported separately. Do you think that would be more expected behavior if there were read_len1 and read_len2?

Phlya · 2025-01-14T10:57:23Z

There are read_len1 and read_len2 for each pair already, and they are reported separately - and correctly in case of R1-2 pairs.
Just seems like in case of R1&2 it always reports the value from read1.

agalitsyna · 2025-01-14T11:16:34Z

Yeah, I think it's because when the pair is present on both R1 and R2, we report all the properties of this pair on read1. It's not only about the read length, but all the properties of the alignment. Reporting all the info about the alignments originating from both read pairs is a logistic nightmare, and I do not see the reason why read_length2 shall reflect the properties of read 2 instead of read2. I would just accept this fundamental ambiguity of R1&2 pairs and don't consider reporting R1 info as incorrect.
Will be glad to hear other opinions, the line block responsible for this implementation starts here:

pairtools/pairtools/lib/parse.py

Lines 1093 to 1105 in 9a0b894

    
               pair_index = (idx_left + 1, "R1&2") 
        
               output_pairs.append( 
        
                   format_pair( 
        
                       algns1[idx_left], 
        
                       algns1[idx_left + 1], 
        
                       pair_index=pair_index, 
        
                       algn2_pos3=algns2[idx_right - 1]["pos5"], 
        
                       report_position=report_position, 
        
                       report_orientation=report_orientation, 
        
                   ) 
        
               ) 
        
           # Report all the sequential chimeric pairs in the right read, but not the overlap:

Phlya · 2025-01-14T12:12:33Z

If this is very painful to implement, I understand, no problem. But still I think in principle it would be better/cleaner and less confusing... Off the top of my head I can't imagine a scenario where it actually makes a practical difference, but possibly it exists.

agalitsyna · 2025-01-14T12:23:40Z

Yes, for now we have to make choice whether to report info for only one read if the same pair is present on both. You can always report readIDs of R1&2 and make a more thorough analysis of these reads.
Maybe having an option to change between read1 or read2 reporting is a way to go?
Another thought: maybe there can be an option choice where pairtools would report stacked info for read1 and read2 (e.g., comma-separated)? It's not particularly elegant, but if you really need it, you at least can have both in the .pairs file.

Phlya assigned agalitsyna Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identical read_len1 and read_len2 reported with different real read lengths in R1&2 pairs #262

Identical read_len1 and read_len2 reported with different real read lengths in R1&2 pairs #262

Phlya commented Jan 14, 2025 •

edited

Loading

agalitsyna commented Jan 14, 2025

Phlya commented Jan 14, 2025

agalitsyna commented Jan 14, 2025

Phlya commented Jan 14, 2025

agalitsyna commented Jan 14, 2025

Identical read_len1 and read_len2 reported with different real read lengths in R1&2 pairs #262

Identical read_len1 and read_len2 reported with different real read lengths in R1&2 pairs #262

Comments

Phlya commented Jan 14, 2025 • edited Loading

agalitsyna commented Jan 14, 2025

Phlya commented Jan 14, 2025

agalitsyna commented Jan 14, 2025

Phlya commented Jan 14, 2025

agalitsyna commented Jan 14, 2025

Phlya commented Jan 14, 2025 •

edited

Loading