Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdot responsible for fixing bad HGVS, allow warnings etc #27

Open
davmlaw opened this issue Nov 21, 2022 · 5 comments
Open

cdot responsible for fixing bad HGVS, allow warnings etc #27

davmlaw opened this issue Nov 21, 2022 · 5 comments

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Nov 21, 2022

There are plenty of bad HGVS strings out there, especially when people are typing into a search box - eg they put spaces in there, forget the colon, have unbalanced brackets etc.

VariantGrid has a lot of functionality to handle sloppy/bad HGVSs - mostly in search and HGVS Matcher

Ideally should move all of this functionality into cdot, so that it can be generally useful.

Would be nice to have a framework where you return a list of well structured warnings / errors etc.

@davmlaw
Copy link
Contributor Author

davmlaw commented Feb 27, 2023

It would be good to collect a huge test case of bad HGVSs (from search bars around the place) and then work out how to resolve them etc

@TheMadBug
Copy link
Member

The two big issues we see in Shariant search (the examples aren't valid, just showing off the kinds of issues) :

  • Incorrect case (both the NM and the c) e.g. nm_002342.3(BRCA2):C.5094-11G>A
  • Trailing quotes or leading tabs NM_010934.4:c.3112A>G' though I believe fixing that should probably be done at the application layer before it gets to c.dot as those kind of issues apply to all searches.

@davmlaw
Copy link
Contributor Author

davmlaw commented Mar 1, 2023

Will run on each environment:

import socket
import pandas as pd
from django.db.models import Q
from eventlog.models import Event

hostname = socket.gethostname()
search_qs = Event.objects.filter(name='search')
search_hgvs = search_qs.filter(Q(details__icontains='c.') | Q(details__icontains=':c'))
df = pd.DataFrame.from_records(search_hgvs.values_list("date", "details"))

df.to_csv(f"/tmp/{hostname}_search_hgvs.csv", index=False)

Then collect them all together. Have put scripts in "paper" directory in cdot github

Emailed csv to James and myself to continue analysis (need to clean etc stuff from private servers before I share it)

@davmlaw
Copy link
Contributor Author

davmlaw commented Mar 1, 2023

Web developers know to clean their user text, but the main use case of cdot would be bioinformaticians hacking together scripts I think

We could run an evaluation of how many HGVSs resolve from the literature and ClinVar etc as well

@davmlaw
Copy link
Contributor Author

davmlaw commented Apr 20, 2023

Few thoughts:

At the moment 0 modification is done on HGVS import
This change proposes to add 1 cleaning op
On search, a number of cleaning ops are performed

Search currently works via:

  • Cleaning is automatically done, but ad-hoc
  • Messages are put into the search result object
  • Exceptions are returned and stop further search - can be rendered as warnings/errors

Few ideas:

  • If you add a general "clean hgvs" method - it would be good to be able to pass the subset of cleaning operations you want done, or maybe even expose the individual functions or cleaning classes however we do it

@davmlaw davmlaw changed the title Resolving / fixing bad HGVS, allow warnings etc cdot responsible for fixing bad HGVS, allow warnings etc Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants