-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up loading a MoleculeStore
from a QCArchiveDataset
#81
Conversation
yammbs/_store.py
Outdated
if record.qcarchive_id in seen: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of these lines isn't hit in code coverage - I think record.qcarchive_id
is always unique, so skipping over duplicates isn't necessary (here), whereas this is not the case for molecules de-duplicated by SMILES?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds right, I'll just delete the check. Thanks!
I'm not sure how this works in the ORM, but possibly a UNIQUE
constraint could be added on that field to the database itself if you wanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't thought that far - do you think it makes sense to add it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. Knowing UNIQUE
exists is near the edge of my database expertise. I think it will cause an error if multiple records with the same ID try to be inserted, so if other parts of the code rely on the uniqueness assumption, it might be a good thing to add. Otherwise I don't think it should have any effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % a couple of questions for me to better follow the flow of information
(Feel free to merge if I forget to loop back to this!
yammbs/_store.py
Outdated
inchi_key=molecule.to_inchi(fixed_hydrogens=True), | ||
) | ||
db.db.add(db_record) | ||
db.db.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this is here but there isn't the same call around L590?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite remember, and git log
(or blame
) isn't helping me find it. I vaguely remember there being some issue with IDs not incrementing, but that would seem to affect the call below too, as you point out. I think that had more to do with closing and reopening the session, as the comment says. Removing the extra commit
call doesn't cause any tests to fail, so maybe it was just overly cautious.
I'll go ahead and push that commit to avoid that confusion in the future, if that sounds good to you.
(This is mostly unrelated to this PR, but I'd be happy to delete the CachedResultCollection
stuff too now that YDS is updated. I accidentally deleted the same line in that constructor first since they look so similar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to delete the CachedResultCollection stuff
Please do! I keep forgetting to put that request to text
Feel free to do that here or in a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And re: commit()
I leaned towards thinking it was harmless, was just confused that it wasn't in both places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it was probably pretty harmless, but calling it each time likely slows things down a bit. I'm also realizing that it was probably my first attempt to fix the session issue that I never went back and cleaned up after I found the real fix. Since I couldn't come up with a reason to comment why it was there, and the tests pass without it, I went ahead and deleted it.
Do you have any preference on deleting the CachedResultCollection
here or in a separate PR? Separate seems cleanest, but I'm happy to do it here if that's easier. Doing it here is marginally easier for me anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference (only strong enough to serve as a tie-breaker) is for it to be in a separate PR
The changes here are analogous to (and copied from) those in #21. As before, the source of the improvements is holding the DB session for multiple insertions instead of inserting one record at a time. I felt a little bad copy-pasting, but I think
CachedResultCollection
is deprecated and likely to be removed soon anyway.I also modified one of the tests because it wasn't failing when I added an early return to
from_qcarchive_dataset
that returned an emptyMoleculeStore
.In terms of speedup, I'm seeing almost 20x in the following script:
The old version of the code took 2751.4 seconds to load cache.json from YDS, while the new version only took 143.7 seconds. These numbers are from my desktop, but the "old" value aligns well with storing the molecules taking about 45 minutes on AWS in YDS too.