Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow not normalizing of bases in the MAF with command line option #232

Closed
inodb opened this issue Oct 17, 2022 · 0 comments · Fixed by #233
Closed

Allow not normalizing of bases in the MAF with command line option #232

inodb opened this issue Oct 17, 2022 · 0 comments · Fixed by #233
Assignees

Comments

@inodb
Copy link
Member

inodb commented Oct 17, 2022

This is related to this issue:

mskcc/vcf2maf#279

Basically occasionally you might want to keep the ref/alt allele bases because it gives you more information about the surrounding bases. There are three options for the base normalization:

  1. Strip off only the very first matching base from ref+alt (first)
  2. Strip off all matching starting bases from ref+alt (all) -- this is the current behavior
  3. Don't do any harmonization (do store all the appropriate fields as if it were normalized) (none)

This could be something that is relevant for both annotation-tools as well as genome-nexus-annotation-pipleine. The former does the vcf2maf conversion, but the latter also does harmonization of bases as well (the API returns harmonized version of chrom/pos/ref/alt). We should prolly add options to both those tools around this, so the annotation pipeline can have some option like this:

--strip-matching-bases {first,all,none}

And the annotation-tools could have something like:

--strip-matching-bases {first,all}

For annotation-tools it prolly doesn't make sense to have the "none" option since you are starting from the VCF file which by definition lists the additional base in ref and alt for indels

Note that the issue with using "first' is that if you run the MAF thru multiple times it will change every time until all bases are stripped off. This is not a big deal if you start from the source VCF, which is how it works for most internal pipelines at MSK, but it can be an issue when you use MAF as the source of truth file. Some way to capture immutable genomic locations was implemented previously but never merged so might be good to revisit that. Another option is to add some feature like that in the conversion script from VCF to MAF i.e. add the original VCF fields in the resulting MAF to make sure you don't lose the source of truth. Then whenever you re-annotate you use the source of truth fields rather than the potentially harmonized fields

Note: need to figure out what to do with matching ending bases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants