-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update to work with vcf 4.4 prefixed phasing info #1861
base: develop
Are you sure you want to change the base?
Conversation
htslib/vcf.h
Outdated
* Explicit / prefixed phasing for 1st allele is used only when it is a must to | ||
* correctly express phasing. | ||
*/ | ||
static inline int bcf_format_gt1(const bcf_hdr_t *hdr, bcf_fmt_t *fmt, int isample, kstring_t *str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suffix 1, as used here, is confusing and somewhat conflicting with the convention used throughout the rest of the code. Can you rename so that the 'phase' keyword is included? Or, even better, create an extended function where various flags can be turned on, starting with PHASE_1ST_ALLELE, or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name should not be made more specific, and if possible made more generic!
An implementation which supports the newer format should be the 1st choice going forward.
Best would have been to update the current method itself but not possible as it would break the compatibility due to signature change.
Hence chose a name which is in sync with current method. Also we have a few others like index_load/index_build/ etc in same lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the name to bcf_format_gt_v2
Re If it is decided to keep |
updated based on review discussions. |
vcf.c
Outdated
goto fail; | ||
} | ||
ptr = hdr ? (void*)hdr : (void*)verstr; | ||
if (ptr != src) { //reset log status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than use a static variable, which can cause problems with threads etc., it would be better to use the value in get_hdr_aux(hdr)->version
to check if the version has been set. The bcf_hdr_init()
function would need changing slightly to make this work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
vcf.c
Outdated
@@ -1420,6 +1492,7 @@ bcf_hdr_t *bcf_hdr_init(const char *mode) | |||
if ( (aux->gen = kh_init(hdict))==NULL ) { free(aux); goto fail; } | |||
aux->key_len = NULL; | |||
aux->dict = *((vdict_t*)h->dict[0]); | |||
aux->version = VCF_DEF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would need to start as something like 0 to allow it to be used to detect if the file version was set. It should always be set to something else before the value is used because the fileformat
header is compulsory, but even if it isn't it should still work as we check for version >= VCF44
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
81f814d
to
3f89c0a
Compare
vcf.c
Outdated
} | ||
#undef BRANCH | ||
|
||
if (get_hdr_aux(hdr)->version >= VCF44) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs if (hdr && ...
else Bad Things happen if you use the old bcf_format_gt()
interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, required as bcf_format_gt is still available.
updated.
vcf.c
Outdated
if (get_hdr_aux(hdr)->version >= VCF44) { | ||
//output which supports prefixed phasing | ||
|
||
/* update 1st allele's phasing if required and append rest to it. | ||
use prefixed phasing only when it is a must. i.e. without which the | ||
inferred value will be incorrect */ | ||
if (val0 & 1) { | ||
/* 1st one is phased, if ploidy is > 1 and an unphased allele exists | ||
need to specify explicitly */ | ||
e |= (ploidy > 1 && anyunphased) ? | ||
(kputc('|', &tmp2) < 0) : | ||
(ploidy <= 1 && !((val0 >> 1)) ? //|. needs explicit o/p | ||
(kputc('|', &tmp2) < 0) : | ||
0); | ||
} else { | ||
/* 1st allele is unphased, if ploidy is = 1 or allele is '.' or | ||
ploidy > 1 and no other unphased allele exist, need to specify | ||
explicitly */ | ||
e |= ((ploidy <= 1 && val0 != 0) || (ploidy > 1 && !anyunphased)) ? | ||
(kputc('/', &tmp2) < 0) : | ||
0; | ||
} | ||
e |= kputsn(tmp1.s, tmp1.l, &tmp2) < 0; //append rest with updated one | ||
ks_free(&tmp1); | ||
tmp1 = tmp2; | ||
} | ||
//updated v44 string or <v44 without any update | ||
e |= kputsn(tmp1.s, tmp1.l, str) < 0; | ||
ks_free(&tmp1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be made mode efficient, especially in the case where the phase prefix isn't needed, by writing directly into str
. Then if you do need to add a prefix you can extend the string by one byte, use memmove()
to shuffle everything down and then pop the prefix at the front. This would avoid having to allocate and free temporary buffers. Assuming the prefix mostly isn't needed I suspect it would also be faster than the alternative of scanning the GT data up-front to see if you have to add it.
As it could be useful elsewhere, it might be good to add a ks_insert_char(char c, size_t pos)
function to htslib/kstring.h
to do the memmove()
and character insertion business.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added kinsert_char and kinsert_str.
Uses kinsert_char to insert leading phase indicator when required.
084325b
to
d9d1922
Compare
update to work with vcf 4.4 specification, on prefixed phasing indicators.
vcf files with version < 4.4 are handled as earlier and changes apply only to v4.4 and later.
when explicit phasing indicator present, it is used.
when there are no explicit phasing indicator, phasing status is identified as given in 6.3.3 of vcf4.4 format.
haploids considered implicitly phased and others based on presence of other unphased alleles.
outputs will have prefixed phasing indicator when phasing can not be correctly inferred without it being present.
e.g. unphased haploid - though not found usually, unphased 1st allele with all other alleles phased, phased 1st allele with another unphased allele etc.
A new bcf_format_gt1 method, derived from bcf_format_gt does this output formatting. bcftools may have to use this method to handle vcf 4.4 with prefixed phasing info.
A version enumeration type is added and a method to convert file format header to enumerated version also added.
Fixes #1847