Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to work with vcf 4.4 prefixed phasing info #1861

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

vasudeva8
Copy link
Contributor

@vasudeva8 vasudeva8 commented Nov 26, 2024

update to work with vcf 4.4 specification, on prefixed phasing indicators.

vcf files with version < 4.4 are handled as earlier and changes apply only to v4.4 and later.
when explicit phasing indicator present, it is used.
when there are no explicit phasing indicator, phasing status is identified as given in 6.3.3 of vcf4.4 format.
haploids considered implicitly phased and others based on presence of other unphased alleles.

outputs will have prefixed phasing indicator when phasing can not be correctly inferred without it being present.
e.g. unphased haploid - though not found usually, unphased 1st allele with all other alleles phased, phased 1st allele with another unphased allele etc.

A new bcf_format_gt1 method, derived from bcf_format_gt does this output formatting. bcftools may have to use this method to handle vcf 4.4 with prefixed phasing info.
A version enumeration type is added and a method to convert file format header to enumerated version also added.

Fixes #1847

htslib/vcf.h Outdated
* Explicit / prefixed phasing for 1st allele is used only when it is a must to
* correctly express phasing.
*/
static inline int bcf_format_gt1(const bcf_hdr_t *hdr, bcf_fmt_t *fmt, int isample, kstring_t *str)
Copy link
Member

@pd3 pd3 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suffix 1, as used here, is confusing and somewhat conflicting with the convention used throughout the rest of the code. Can you rename so that the 'phase' keyword is included? Or, even better, create an extended function where various flags can be turned on, starting with PHASE_1ST_ALLELE, or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name should not be made more specific, and if possible made more generic!

An implementation which supports the newer format should be the 1st choice going forward.
Best would have been to update the current method itself but not possible as it would break the compatibility due to signature change.
Hence chose a name which is in sync with current method. Also we have a few others like index_load/index_build/ etc in same lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the name to bcf_format_gt_v2

@jmarshall
Copy link
Member

Re bcf_get_version(), I think the “htslib way” would be to accept this notation regardless of input file VCF level — trusting the users only to use it where appropriate.

If it is decided to keep bcf_get_version(), I think it would be better to return the VCF level as an integer like 4005000 and avoid introducing an enum like bcf_version.

@vasudeva8
Copy link
Contributor Author

updated based on review discussions.
moved version to bcf_hdr_aux_t to cache it and avoid repeated retrieval from header
updated bcf_format_gt to use bcf_format_gt_v2
changed enumeration to an integer variable
uses 4.2 as the default, 4002000

vcf.c Show resolved Hide resolved
vcf.c Outdated
goto fail;
}
ptr = hdr ? (void*)hdr : (void*)verstr;
if (ptr != src) { //reset log status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than use a static variable, which can cause problems with threads etc., it would be better to use the value in get_hdr_aux(hdr)->version to check if the version has been set. The bcf_hdr_init() function would need changing slightly to make this work though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

vcf.c Outdated
@@ -1420,6 +1492,7 @@ bcf_hdr_t *bcf_hdr_init(const char *mode)
if ( (aux->gen = kh_init(hdict))==NULL ) { free(aux); goto fail; }
aux->key_len = NULL;
aux->dict = *((vdict_t*)h->dict[0]);
aux->version = VCF_DEF;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need to start as something like 0 to allow it to be used to detect if the file version was set. It should always be set to something else before the value is used because the fileformat header is compulsory, but even if it isn't it should still work as we check for version >= VCF44.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@vasudeva8 vasudeva8 force-pushed the vcf44GT branch 2 times, most recently from 81f814d to 3f89c0a Compare January 7, 2025 15:02
htslib/vcf.h Show resolved Hide resolved
vcf.c Outdated
}
#undef BRANCH

if (get_hdr_aux(hdr)->version >= VCF44) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs if (hdr && ... else Bad Things happen if you use the old bcf_format_gt() interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, required as bcf_format_gt is still available.
updated.

vcf.c Outdated
Comment on lines 6106 to 6134
if (get_hdr_aux(hdr)->version >= VCF44) {
//output which supports prefixed phasing

/* update 1st allele's phasing if required and append rest to it.
use prefixed phasing only when it is a must. i.e. without which the
inferred value will be incorrect */
if (val0 & 1) {
/* 1st one is phased, if ploidy is > 1 and an unphased allele exists
need to specify explicitly */
e |= (ploidy > 1 && anyunphased) ?
(kputc('|', &tmp2) < 0) :
(ploidy <= 1 && !((val0 >> 1)) ? //|. needs explicit o/p
(kputc('|', &tmp2) < 0) :
0);
} else {
/* 1st allele is unphased, if ploidy is = 1 or allele is '.' or
ploidy > 1 and no other unphased allele exist, need to specify
explicitly */
e |= ((ploidy <= 1 && val0 != 0) || (ploidy > 1 && !anyunphased)) ?
(kputc('/', &tmp2) < 0) :
0;
}
e |= kputsn(tmp1.s, tmp1.l, &tmp2) < 0; //append rest with updated one
ks_free(&tmp1);
tmp1 = tmp2;
}
//updated v44 string or <v44 without any update
e |= kputsn(tmp1.s, tmp1.l, str) < 0;
ks_free(&tmp1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be made mode efficient, especially in the case where the phase prefix isn't needed, by writing directly into str. Then if you do need to add a prefix you can extend the string by one byte, use memmove() to shuffle everything down and then pop the prefix at the front. This would avoid having to allocate and free temporary buffers. Assuming the prefix mostly isn't needed I suspect it would also be faster than the alternative of scanning the GT data up-front to see if you have to add it.

As it could be useful elsewhere, it might be good to add a ks_insert_char(char c, size_t pos) function to htslib/kstring.h to do the memmove() and character insertion business.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added kinsert_char and kinsert_str.
Uses kinsert_char to insert leading phase indicator when required.

@vasudeva8 vasudeva8 force-pushed the vcf44GT branch 3 times, most recently from 084325b to d9d1922 Compare January 13, 2025 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTSlib does not support VCF v4.4's explicit first phasing indicator
4 participants