Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up loading a MoleculeStore from a QCArchiveDataset #81

Merged
merged 5 commits into from
Nov 15, 2024

Conversation

ntBre
Copy link
Contributor

@ntBre ntBre commented Nov 8, 2024

The changes here are analogous to (and copied from) those in #21. As before, the source of the improvements is holding the DB session for multiple insertions instead of inserting one record at a time. I felt a little bad copy-pasting, but I think CachedResultCollection is deprecated and likely to be removed soon anyway.

I also modified one of the tests because it wasn't failing when I added an early return to from_qcarchive_dataset that returned an empty MoleculeStore.

In terms of speedup, I'm seeing almost 20x in the following script:

import time

from yammbs import MoleculeStore
from yammbs.inputs import QCArchiveDataset

ds = (
    "/home/brent/omsf/clone/yammbs-dataset-submission/datasets/"
    "OpenFF-Industry-Benchmark-Season-1-v1.1/cache.json"
)

with open(ds) as inp:
    qca = QCArchiveDataset.model_validate_json(inp.read())

start = time.time()
MoleculeStore.from_qcarchive_dataset(qca, "try.sqlite")

print(start - time.time())

The old version of the code took 2751.4 seconds to load cache.json from YDS, while the new version only took 143.7 seconds. These numbers are from my desktop, but the "old" value aligns well with storing the molecules taking about 45 minutes on AWS in YDS too.

@codecov-commenter
Copy link

codecov-commenter commented Nov 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.68%. Comparing base (5342a12) to head (000d91c).
Report is 9 commits behind head on main.

Additional details and impacted files

yammbs/_store.py Outdated
Comment on lines 586 to 587
if record.qcarchive_id in seen:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of these lines isn't hit in code coverage - I think record.qcarchive_id is always unique, so skipping over duplicates isn't necessary (here), whereas this is not the case for molecules de-duplicated by SMILES?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds right, I'll just delete the check. Thanks!

I'm not sure how this works in the ORM, but possibly a UNIQUE constraint could be added on that field to the database itself if you wanted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't thought that far - do you think it makes sense to add it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Knowing UNIQUE exists is near the edge of my database expertise. I think it will cause an error if multiple records with the same ID try to be inserted, so if other parts of the code rely on the uniqueness assumption, it might be a good thing to add. Otherwise I don't think it should have any effect.

Copy link
Member

@mattwthompson mattwthompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % a couple of questions for me to better follow the flow of information

(Feel free to merge if I forget to loop back to this!

yammbs/_store.py Outdated
inchi_key=molecule.to_inchi(fixed_hydrogens=True),
)
db.db.add(db_record)
db.db.commit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this is here but there isn't the same call around L590?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite remember, and git log (or blame) isn't helping me find it. I vaguely remember there being some issue with IDs not incrementing, but that would seem to affect the call below too, as you point out. I think that had more to do with closing and reopening the session, as the comment says. Removing the extra commit call doesn't cause any tests to fail, so maybe it was just overly cautious.

I'll go ahead and push that commit to avoid that confusion in the future, if that sounds good to you.

(This is mostly unrelated to this PR, but I'd be happy to delete the CachedResultCollection stuff too now that YDS is updated. I accidentally deleted the same line in that constructor first since they look so similar)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to delete the CachedResultCollection stuff

Please do! I keep forgetting to put that request to text

Feel free to do that here or in a follow-up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And re: commit() I leaned towards thinking it was harmless, was just confused that it wasn't in both places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it was probably pretty harmless, but calling it each time likely slows things down a bit. I'm also realizing that it was probably my first attempt to fix the session issue that I never went back and cleaned up after I found the real fix. Since I couldn't come up with a reason to comment why it was there, and the tests pass without it, I went ahead and deleted it.

Do you have any preference on deleting the CachedResultCollection here or in a separate PR? Separate seems cleanest, but I'm happy to do it here if that's easier. Doing it here is marginally easier for me anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference (only strong enough to serve as a tie-breaker) is for it to be in a separate PR

yammbs/_store.py Show resolved Hide resolved
@mattwthompson mattwthompson merged commit 8bc45d2 into main Nov 15, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants