Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ga4gh: prefix from VCF annotations? #470

Open
jsstevenson opened this issue Dec 9, 2024 · 5 comments
Open

Remove ga4gh: prefix from VCF annotations? #470

jsstevenson opened this issue Dec 9, 2024 · 5 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@jsstevenson
Copy link
Contributor

jsstevenson commented Dec 9, 2024

Currently, the VRS VCF annotator can be used to add full VRS IDs for all REF and ALT alleles to VCF records, e.g. (you'll probably need to scroll right):

chr1	783006	.	A	G	50	PASS	platforms=4;platformnames=PacBio,Illumina,10X,CG;datasets=4;datasetnames=CCS15kb_20kb,HiSeqPE300x,10XChromiumLR,CGnormal;VRS_Allele_IDs=ga4gh:VA.dwwiZdvVtfAmomu0OBsiHue1O-bw5SpG,ga4gh:VA.MiasxyXMXtOpsZgGelL3c4QgtflCNLHD;VRS_Starts=783005,783005;VRS_Ends=783006,783006;VRS_States=A,G	GT:PS:DP:ADALL:AD:GQ	1/1:.:652:16,234:0,82:312

These include ga4gh: prefixes. In a file with a lot of rows, this could add a small but nontrivial amount of memory; can we infer (or maybe state explicitly in the corresponding header description) the prefix and leave it off to conserve space?

Similarly, can we leave the VA. prefix? Will VCF attributions with this tool always produce alleles?

@jsstevenson jsstevenson added question Further information is requested enhancement New feature or request labels Dec 9, 2024
@korikuzma
Copy link
Contributor

@jsstevenson Yes, this makes sense. Since VRS is included in the key name, I think it's safe to do this. We may want to consider renaming to VRS_Allele_Digest

@ehclark
Copy link
Contributor

ehclark commented Dec 9, 2024

We should think about compatibility with AnyVar as well. I would consider this to be a breaking change conceptually. AnyVar subclasses the VCFAnnotator and overrides the _get_vrs_object() method, which is what populates the variant VRS fields into the INFO field object.

I think this would make vrs-python and anyvar behave differently when it comes to VCF annotation. Seems like consistency would be better.

@jsstevenson
Copy link
Contributor Author

@ehclark that's a fair point, it definitely is breaking. What if something like this was an optional flag?

@ehclark
Copy link
Contributor

ehclark commented Dec 10, 2024

First, a couple comments on the proposed change separate from any concern related to AnyVar. From my perspective, I would not want the VRS IDs to be shortened. I see vrs-python (via AnyVar) as a way to outsource the calculation of VRS IDs to an external service. Right now it's just variant alleles. But as the types of variants supported by VRS expands, that will change. Having a stable interface between our application and VRS ID computation would be preferable. Second, VRS ID computation is presently does not consume lots of memory per CPU. In my load testing in an OpenShift cluster, 4 pods each running 4 worker processes peaked out at a total of 14 CPUs and 8 GB RAM over a 30 minute period. So the memory savings may be real, but its inconsequential given the general RAM/CPU ratios in standard hardware.

Second, this may be an opportunity to improve the relationship between AnyVar and vrs-python. The AnyVar VcfRegistrar extends VCFAnnotator and overrides the internal _get_vrs_object() method. The only reason for doing so is to integrate with the AnyVar object store. But then AnyVar does not actually re-implement all the vrs-python functionality, which means there will be continual drift between the two implementations. It seems like whether you annotate a VCF with VRS IDs directly in vrs-python or via AnyVar that the behavior and result should be the same. That would make the tools easier to adopt and use. So perhaps a re-design of the vrs-python VCFAnnotator to provide a listener or callback. For example:

class VCFAnnotator:
  # ...existing implementation...

  def on_vrs_object(self, vcf_coords: str, vrs_object: VrsObject) -> bool:
    """Called for each calculated VRS object prior to it being added to the VCF file.
    :param vcf_coords: The VCF coordinates of the VRS object
    :param vrs_object: The computed VRS object
    :return:  True if the VRS object should be added to the output VCF, False otherwise
    """
    return True

# And in AnyVar
class VcfRegistrar(VCFAnnotator):
  def on_vrs_object(self, vcf_coords: str, vrs_object: VrsObject) -> bool:
    self.av.put_object(vrs_object)
    return True

All that said, I think providing an option to drop the leading ga4gh: would be OK.

And lastly, I think that tagging/releasing AnyVar as a beta might help with managing the dependencies between the two. biocommons/anyvar#111

@jsstevenson
Copy link
Contributor Author

But as the types of variants supported by VRS expands, that will change. Having a stable interface between our application and VRS ID computation would be preferable.

Yeah, maybe this was my own ignorance about the kinds of variations that VCFs can report. If we expect to be pulling copy number changes out of there in the future then we clearly need to leave in the type prefix. Obviously these savings are at the margins but I was just thinking of ways that we might be able to score small improvements for size on disk.

👍 for refactoring VRS-Python. I wonder if it'd be helpful to preserve the ability to modify the VRS object on passthrough, eg to add some tailored annotations or mappings to it, but I agree that it'd be good to isolate this part so that we can add hooks there in subclasses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants