-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove ga4gh:
prefix from VCF annotations?
#470
Comments
@jsstevenson Yes, this makes sense. Since VRS is included in the key name, I think it's safe to do this. We may want to consider renaming to |
We should think about compatibility with AnyVar as well. I would consider this to be a breaking change conceptually. AnyVar subclasses the I think this would make vrs-python and anyvar behave differently when it comes to VCF annotation. Seems like consistency would be better. |
@ehclark that's a fair point, it definitely is breaking. What if something like this was an optional flag? |
First, a couple comments on the proposed change separate from any concern related to AnyVar. From my perspective, I would not want the VRS IDs to be shortened. I see vrs-python (via AnyVar) as a way to outsource the calculation of VRS IDs to an external service. Right now it's just variant alleles. But as the types of variants supported by VRS expands, that will change. Having a stable interface between our application and VRS ID computation would be preferable. Second, VRS ID computation is presently does not consume lots of memory per CPU. In my load testing in an OpenShift cluster, 4 pods each running 4 worker processes peaked out at a total of 14 CPUs and 8 GB RAM over a 30 minute period. So the memory savings may be real, but its inconsequential given the general RAM/CPU ratios in standard hardware. Second, this may be an opportunity to improve the relationship between AnyVar and vrs-python. The AnyVar class VCFAnnotator:
# ...existing implementation...
def on_vrs_object(self, vcf_coords: str, vrs_object: VrsObject) -> bool:
"""Called for each calculated VRS object prior to it being added to the VCF file.
:param vcf_coords: The VCF coordinates of the VRS object
:param vrs_object: The computed VRS object
:return: True if the VRS object should be added to the output VCF, False otherwise
"""
return True
# And in AnyVar
class VcfRegistrar(VCFAnnotator):
def on_vrs_object(self, vcf_coords: str, vrs_object: VrsObject) -> bool:
self.av.put_object(vrs_object)
return True All that said, I think providing an option to drop the leading And lastly, I think that tagging/releasing AnyVar as a beta might help with managing the dependencies between the two. biocommons/anyvar#111 |
Yeah, maybe this was my own ignorance about the kinds of variations that VCFs can report. If we expect to be pulling copy number changes out of there in the future then we clearly need to leave in the type prefix. Obviously these savings are at the margins but I was just thinking of ways that we might be able to score small improvements for size on disk. 👍 for refactoring VRS-Python. I wonder if it'd be helpful to preserve the ability to modify the VRS object on passthrough, eg to add some tailored annotations or mappings to it, but I agree that it'd be good to isolate this part so that we can add hooks there in subclasses. |
Currently, the VRS VCF annotator can be used to add full VRS IDs for all REF and ALT alleles to VCF records, e.g. (you'll probably need to scroll right):
These include
ga4gh:
prefixes. In a file with a lot of rows, this could add a small but nontrivial amount of memory; can we infer (or maybe state explicitly in the corresponding header description) the prefix and leave it off to conserve space?Similarly, can we leave the
VA.
prefix? Will VCF attributions with this tool always produce alleles?The text was updated successfully, but these errors were encountered: