Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release sage 2.0.0 torsion #419

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# OpenFF Sage 2.0.0 Torsion Drive Training Dataset v1.0

## Description

A quantum chemical (QC) dataset curated to train the OpenFF 2.0.0 Sage torsion potentials. This QC dataset with the OpenFF default level of theory, B3LYP-D3BJ/DZVP, is used to benchmark Sage geometries and energetics. These optimized conformer geometries where used to train one dimensional torsional profiles. This Generation 2 dataset increases chemical diversity when compared to Generation 1, which are of value to our industry partners. Large molecules (>20 heavy atoms) were also included, including more flexible molecules and a greater degree of conformational variation which provide intramolecular interactions. This is the complete optimization dataset used for training OpenFF 2.0.0 Sage, consisting of the following datasets:

'OpenFF Gen 2 Torsion Set 1 Roche',
'OpenFF Gen 2 Torsion Set 2 Coverage', 'OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy', 'OpenFF Gen 2 Torsion Set 4 eMolecules - Discrepancy', 'OpenFF Gen 2 Torsion Set 5 Bayer' and 'OpenFF Gen 2 Torsion Set 6 supplemental 2'. The `HydrogenBondFilter(method='baker-hubbard')` filter was applied, and the following record IDs were dropped due to issues with ForceBalance: 6098580, 2703504, 2703505, 18045478. Further information can be found in the curation scripts for the linked repositories.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the optimization training set PR you linked the OpenFF Sage repo, as well as the directories of each of the source torsion drive sets. That was quite nice, could you please do that here as well?


## General Information

* Date: 2024-12-17
* Class: OpenFF TorsionDrive Dataset
* Purpose: Complete set of training data for OpenFF 2.0.0 Sage
* Name: OpenFF Sage 2.0.0 Torsion Drive Training Dataset v1.0
* Number of unique molecules: 562
* Number of filtered molecules: 0
* Number of driven torsions: 713
* Number of conformers: 563
* Number of conformers (min, mean, max): 1.00, 1.00, 2.00
* Molecular weight (min, mean, max): 46.07, 224.91, 503.42
* Charges: -1.0, 0.0, 1.0
* Submitter: Jennifer A Clark
* Dataset generator: Hyesu Jang
jaclark5 marked this conversation as resolved.
Show resolved Hide resolved

## QCSubmit Generation Pipeline

* `generate-combined-dataset.py`: A python script which shows how the dataset was prepared from the input files.
jaclark5 marked this conversation as resolved.
Show resolved Hide resolved


## QCSubmit Manifest

* `generate-combined-dataset.py`:
* `dataset.json.bz2`: The basic dataset ready for submission.
* `dataset.pdf`: A pdf file containing molecule 2D structures.
* `dataset.smi`: SMILES for every molecule in the submission.


## Metadata

* Elements: {Br, C, Cl, F, H, I, N, O, P, S}
* QC Specifications: default
* basis: DZVP
* implicit_solvent: None
* keywords: {}
* maxiter: 200
* method: B3LYP-D3BJ
* program: psi4
* SCF Properties:
* dipole
* quadrupole
* wiberg_lowdin_indices
* mayer_indices
Git LFS file not shown
Binary file not shown.
Loading
Loading