Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identical read_len1 and read_len2 reported with different real read lengths in R1&2 pairs #262

Open
Phlya opened this issue Jan 14, 2025 · 5 comments
Assignees

Comments

@Phlya
Copy link
Member

Phlya commented Jan 14, 2025

I have some data with different lengths of read1 (81 nt) and read2 (221 nt) and I want to report the length of the read in the pairs. I add

    --add-columns pos5,pos3,read_len,mapq \

to parse2, but in the pairs, I think depending on walk pair type, the read length is sometimes reported incorrectly.
In the case of R1-2 reads both values are reported, in case R1 or R2 reads only the appropriate value is reported. I think this is correct behaviour, but perhaps should be clarified in the docs?

However, in R1&2 pairs both read lengths are reported as 81 nt, while I would expect both 81 and 221 nt reported. This is not what I would expect in this situation, is this intended behaviour?

@agalitsyna
Copy link
Member

It looks like your use case implies that read lengths for left and right alignments in the pair should have their lengths reported separately. Do you think that would be more expected behavior if there were read_len1 and read_len2?

@Phlya
Copy link
Member Author

Phlya commented Jan 14, 2025

There are read_len1 and read_len2 for each pair already, and they are reported separately - and correctly in case of R1-2 pairs.
Just seems like in case of R1&2 it always reports the value from read1.

@agalitsyna
Copy link
Member

Yeah, I think it's because when the pair is present on both R1 and R2, we report all the properties of this pair on read1. It's not only about the read length, but all the properties of the alignment. Reporting all the info about the alignments originating from both read pairs is a logistic nightmare, and I do not see the reason why read_length2 shall reflect the properties of read 2 instead of read2. I would just accept this fundamental ambiguity of R1&2 pairs and don't consider reporting R1 info as incorrect.
Will be glad to hear other opinions, the line block responsible for this implementation starts here:

pair_index = (idx_left + 1, "R1&2")
output_pairs.append(
format_pair(
algns1[idx_left],
algns1[idx_left + 1],
pair_index=pair_index,
algn2_pos3=algns2[idx_right - 1]["pos5"],
report_position=report_position,
report_orientation=report_orientation,
)
)
# Report all the sequential chimeric pairs in the right read, but not the overlap:

@Phlya
Copy link
Member Author

Phlya commented Jan 14, 2025

If this is very painful to implement, I understand, no problem. But still I think in principle it would be better/cleaner and less confusing... Off the top of my head I can't imagine a scenario where it actually makes a practical difference, but possibly it exists.

@agalitsyna
Copy link
Member

Yes, for now we have to make choice whether to report info for only one read if the same pair is present on both. You can always report readIDs of R1&2 and make a more thorough analysis of these reads.
Maybe having an option to change between read1 or read2 reporting is a way to go?
Another thought: maybe there can be an option choice where pairtools would report stacked info for read1 and read2 (e.g., comma-separated)? It's not particularly elegant, but if you really need it, you at least can have both in the .pairs file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants