Speed up loading a `MoleculeStore` from a `QCArchiveDataset` #81

ntBre · 2024-11-08T21:38:22Z

The changes here are analogous to (and copied from) those in #21. As before, the source of the improvements is holding the DB session for multiple insertions instead of inserting one record at a time. I felt a little bad copy-pasting, but I think CachedResultCollection is deprecated and likely to be removed soon anyway.

I also modified one of the tests because it wasn't failing when I added an early return to from_qcarchive_dataset that returned an empty MoleculeStore.

In terms of speedup, I'm seeing almost 20x in the following script:

import time

from yammbs import MoleculeStore
from yammbs.inputs import QCArchiveDataset

ds = (
    "/home/brent/omsf/clone/yammbs-dataset-submission/datasets/"
    "OpenFF-Industry-Benchmark-Season-1-v1.1/cache.json"
)

with open(ds) as inp:
    qca = QCArchiveDataset.model_validate_json(inp.read())

start = time.time()
MoleculeStore.from_qcarchive_dataset(qca, "try.sqlite")

print(start - time.time())

The old version of the code took 2751.4 seconds to load cache.json from YDS, while the new version only took 143.7 seconds. These numbers are from my desktop, but the "old" value aligns well with storing the molecules taking about 45 minutes on AWS in YDS too.

codecov-commenter · 2024-11-08T21:41:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.68%. Comparing base (5342a12) to head (000d91c).
Report is 9 commits behind head on main.

Additional details and impacted files

mattwthompson · 2024-11-08T21:50:25Z

yammbs/_store.py

+                if record.qcarchive_id in seen:
+                    continue


One of these lines isn't hit in code coverage - I think record.qcarchive_id is always unique, so skipping over duplicates isn't necessary (here), whereas this is not the case for molecules de-duplicated by SMILES?

That sounds right, I'll just delete the check. Thanks!

I'm not sure how this works in the ORM, but possibly a UNIQUE constraint could be added on that field to the database itself if you wanted.

I hadn't thought that far - do you think it makes sense to add it?

I'm not sure. Knowing UNIQUE exists is near the edge of my database expertise. I think it will cause an error if multiple records with the same ID try to be inserted, so if other parts of the code rely on the uniqueness assumption, it might be a good thing to add. Otherwise I don't think it should have any effect.

mattwthompson

LGTM % a couple of questions for me to better follow the flow of information

(Feel free to merge if I forget to loop back to this!

mattwthompson · 2024-11-15T20:26:59Z

yammbs/_store.py

+                    inchi_key=molecule.to_inchi(fixed_hydrogens=True),
+                )
+                db.db.add(db_record)
+                db.db.commit()


Curious why this is here but there isn't the same call around L590?

I don't quite remember, and git log (or blame) isn't helping me find it. I vaguely remember there being some issue with IDs not incrementing, but that would seem to affect the call below too, as you point out. I think that had more to do with closing and reopening the session, as the comment says. Removing the extra commit call doesn't cause any tests to fail, so maybe it was just overly cautious.

I'll go ahead and push that commit to avoid that confusion in the future, if that sounds good to you.

(This is mostly unrelated to this PR, but I'd be happy to delete the CachedResultCollection stuff too now that YDS is updated. I accidentally deleted the same line in that constructor first since they look so similar)

I'd be happy to delete the CachedResultCollection stuff

Please do! I keep forgetting to put that request to text

Feel free to do that here or in a follow-up

And re: commit() I leaned towards thinking it was harmless, was just confused that it wasn't in both places

I agree it was probably pretty harmless, but calling it each time likely slows things down a bit. I'm also realizing that it was probably my first attempt to fix the session issue that I never went back and cleaned up after I found the real fix. Since I couldn't come up with a reason to comment why it was there, and the tests pass without it, I went ahead and deleted it.

Do you have any preference on deleting the CachedResultCollection here or in a separate PR? Separate seems cleanest, but I'm happy to do it here if that's easier. Doing it here is marginally easier for me anyway.

My preference (only strong enough to serve as a tie-breaker) is for it to be in a separate PR

yammbs/_store.py

ntBre added 3 commits November 8, 2024 15:07

test that the number of molecules is the same as the input

4df62d0

follow from_cached_result_collection instead of from_qcsubmit

26d7213

run pre-commit

79a7c26

mattwthompson reviewed Nov 8, 2024

View reviewed changes

delete unused seen check on qcarchive_id

7f8bc88

mattwthompson approved these changes Nov 15, 2024

View reviewed changes

delete unnecessary db.commit() call

000d91c

mattwthompson merged commit 8bc45d2 into main Nov 15, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up loading a `MoleculeStore` from a `QCArchiveDataset` #81

Speed up loading a `MoleculeStore` from a `QCArchiveDataset` #81

ntBre commented Nov 8, 2024

codecov-commenter commented Nov 8, 2024 •

edited

Loading

mattwthompson Nov 8, 2024

ntBre Nov 8, 2024

mattwthompson Nov 8, 2024

ntBre Nov 8, 2024

mattwthompson left a comment

mattwthompson Nov 15, 2024

ntBre Nov 15, 2024

mattwthompson Nov 15, 2024

mattwthompson Nov 15, 2024

ntBre Nov 15, 2024

mattwthompson Nov 15, 2024

Speed up loading a MoleculeStore from a QCArchiveDataset #81

Speed up loading a MoleculeStore from a QCArchiveDataset #81

Conversation

ntBre commented Nov 8, 2024

codecov-commenter commented Nov 8, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattwthompson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Speed up loading a `MoleculeStore` from a `QCArchiveDataset` #81

Speed up loading a `MoleculeStore` from a `QCArchiveDataset` #81

codecov-commenter commented Nov 8, 2024 •

edited

Loading