Skip to content

Commit

Permalink
Release sage 2.0.0 (#418)
Browse files Browse the repository at this point in the history
* Initial files

* Generate dataset descriptive files

* Update method to use requests to pull relevant file from github

* Add openeye installation to instructions

* Address dataset formatting feedback

* Added forcefield table to repo README
  • Loading branch information
jaclark5 authored Dec 18, 2024
1 parent 41297b0 commit 7f8ed2a
Show file tree
Hide file tree
Showing 8 changed files with 1,544 additions and 0 deletions.
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ Datasets must be submitted as pull requests.
conda env create -f qca-dataset-submission/devtools/prod-envs/qcarchive-user-submit.yaml
conda activate qcarchive-user-submit
```
You may also need to install OpenEye:\
`conda install -c openeye openeye-toolkits`
4. Choose a starting notebook and README based on the type of dataset you wish to submit:
Expand Down Expand Up @@ -202,6 +205,10 @@ The status only refers to the `default` specification which is required for all
[![Running](https://img.shields.io/badge/Status-Running-orange)](https://img.shields.io/badge/Status-Running-orange) the dataset is currently running and may have some incomplete jobs.
# Forcefield Release Datasets
| Forcefield | Repo | Optimization | Torsion Drive | Elements | Zenodo |
|-------------|----------|-------------------|--------------------|----------|--------|
| Release OpenFF 2.0.0 Sage | [openff-sage](https://github.com/openforcefield/openff-sage) | [2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-12-12-OpenFF-Sage-2.0.0-Training-Optimization-Dataset-v1.0) | [Coming Soon]() | H, C, N, O, S, P, F, Cl, Br, I | [Coming Soon]() |
# Basic Datasets
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# OpenFF Sage 2.0.0 Training Optimization v1.0

### Description

A quantum chemical (QC) dataset curated to train [OpenFF 2.0.0 Sage](https://github.com/openforcefield/openff-sage) forcefield, with reparametrized Lennard-Jones (LJ) and valence parameters, the latter relevant to this dataset. This QC dataset with the OpenFF default level of theory, B3LYP-D3BJ/DZVP, is used to benchmark Sage geometries and energetics. These optimized conformer geometries where used in conjunction with the QC dataset used to train one dimensional torsional profiles. This Generation 2 dataset increases chemical diversity when compared to Generation 1, which are of value to our industry partners. Large molecules (>20 heavy atoms) were also included, including more flexible molecules and a greater degree of conformational variation which provide intramolecular interactions.

This is the complete optimization dataset used for training OpenFF 2.0.0 Sage, consisting of the following datasets:

- [OpenFF Gen 2 Opt Set 1 Roche](https://github.com/openforcefield/qca-dataset-submission/tree/0e6e6da930118e2a2d6402b93c3e3e93830600cc/submissions/2020-03-20-OpenFF-Gen-2-Optimization-Set-1-Roche)
- [OpenFF Gen 2 Opt Set 2 Coverage](https://github.com/openforcefield/qca-dataset-submission/tree/0e6e6da930118e2a2d6402b93c3e3e93830600cc/submissions/2020-03-20-OpenFF-Gen-2-Optimization-Set-2-Coverage)
- [OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy](https://github.com/openforcefield/qca-dataset-submission/tree/0e6e6da930118e2a2d6402b93c3e3e93830600cc/submissions/2020-03-20-OpenFF-Gen-2-Optimization-Set-3-Pfizer-Discrepancy)
- [OpenFF Gen 2 Opt Set 4 eMolecules - Discrepancy](https://github.com/openforcefield/qca-dataset-submission/tree/0e6e6da930118e2a2d6402b93c3e3e93830600cc/submissions/2020-03-20-OpenFF-Gen-2-Optimization-Set-4-eMolecules-Discrepancy)
- [OpenFF Gen 2 Opt Set 5 Bayer](https://github.com/openforcefield/qca-dataset-submission/tree/0e6e6da930118e2a2d6402b93c3e3e93830600cc/submissions/2020-03-20-OpenFF-Gen-2-Optimization-Set-5-Bayer)

The following filters were applied to those datasets:

- `RecordStatusFilter(status=RecordStatusEnum.complete)`
- `ConnectivityFilter(tolerance=1.2)`
- `UndefinedStereoFilter()`
- `ElementFilter(allowed_elements=["H", "C", "N", "O", "S", "P", "F", "Cl", "Br", "I"])`
- `ConformerRMSDFilter(max_conformers=10)`

Further information can be found in the curation scripts for the linked repositories.

### General Information

- Date: 2024-12-12
- Class: OpenFF Optimization Dataset
- Purpose: Complete set of training data for OpenFF 2.0.0 Sage
- Dataset Type: optimization
- Name: OpenFF Sage 2.0.0 Training Optimization Dataset v1.0
- Number of unique molecules: 1025
- Number of filtered molecules: 0
- Number of conformers: 3663
- Number of conformers (min mean max): 1.00, 3.53, 10.00
- Mean molecular weight: 261.38
- Max molecular weight: 544.64
- Set of charges: -2.0, -1.0, 0.0, 1.0
- Dataset Submitter: Jennifer A. Clark
- Dataset Curator: Simon Boothroyd
- Dataset Generator: Hyesu Jang

### QCSubmit generation pipeline

- `generate-combined-dataset.py`: A python script which shows how the dataset was prepared from the input files.
- `output.txt`: A text file containing the printed output of `generate-combined-dataset.py`.

### QCSubmit Manifest

- `generate-combined-dataset.py`
- `dataset.json.bz2`: The basic dataset ready for submission.
- `dataset.pdf`: A pdf file containing molecule 2D structures.
- `dataset.smi`: SMILES for every molecule in the submission.

### Metadata

* Elements: {F, I, N, C, P, Cl, S, Br, O, H}
* QC Specifications: default
* basis: DZVP
* implicit_solvent: None
* keywords: {}
* maxiter: 200
* method: B3LYP-D3BJ
* program: psi4
* SCF Properties:
* dipole
* quadrupole
* wiberg_lowdin_indices
* mayer_indices
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
name: qcarchive-user-submit
channels:
- conda-forge
- openeye
dependencies:
- annotated-types=0.7.0=pyhd8ed1ab_1
- apsw=3.47.0.0=py311hde754ab_0
- argcomplete=3.5.2=pyhd8ed1ab_0
- attrs=24.2.0=pyh71513ae_1
- basis_set_exchange=0.10=pyhd8ed1ab_1
- brotli=1.1.0=hd74edd7_2
- brotli-bin=1.1.0=hd74edd7_2
- brotli-python=1.1.0=py311h3f08180_2
- bson=0.5.9=py_0
- bzip2=1.0.8=h99b78c6_7
- ca-certificates=2024.8.30=hf0a4a13_0
- cached-property=1.5.2=hd8ed1ab_1
- cached_property=1.5.2=pyha770c72_1
- cachetools=5.5.0=pyhd8ed1ab_1
- cairo=1.18.2=h6a3b0d2_1
- certifi=2024.8.30=pyhd8ed1ab_0
- cffi=1.17.1=py311h3a79f62_0
- chardet=5.2.0=py311h267d04e_2
- charset-normalizer=3.4.0=pyhd8ed1ab_1
- colorama=0.4.6=pyhd8ed1ab_1
- contourpy=1.3.1=py311h210dab8_0
- cycler=0.12.1=pyhd8ed1ab_1
- dill=0.3.9=pyhd8ed1ab_1
- exceptiongroup=1.2.2=pyhd8ed1ab_1
- font-ttf-dejavu-sans-mono=2.37=hab24e00_0
- font-ttf-inconsolata=3.000=h77eed37_0
- font-ttf-source-code-pro=2.038=h77eed37_0
- font-ttf-ubuntu=0.83=h77eed37_3
- fontconfig=2.15.0=h1383a14_1
- fonts-conda-ecosystem=1=0
- fonts-conda-forge=1=0
- fonttools=4.55.3=py311h4921393_0
- freetype=2.12.1=hadb7bae_2
- freetype-py=2.3.0=pyhd8ed1ab_0
- greenlet=3.1.1=py311h3f08180_0
- h2=4.1.0=pyhd8ed1ab_1
- hpack=4.0.0=pyhd8ed1ab_1
- hyperframe=6.0.1=pyhd8ed1ab_1
- icu=75.1=hfee45f7_0
- idna=3.10=pyhd8ed1ab_1
- importlib-metadata=8.5.0=pyha770c72_1
- importlib_resources=6.4.5=pyhd8ed1ab_1
- iniconfig=2.0.0=pyhd8ed1ab_1
- jsonschema=4.23.0=pyhd8ed1ab_1
- jsonschema-specifications=2024.10.1=pyhd8ed1ab_1
- kiwisolver=1.4.7=py311h2c37856_0
- krb5=1.21.3=h237132a_0
- lcms2=2.16=ha0e7c42_0
- lerc=4.0.0=h9a09cb3_0
- libblas=3.9.0=25_osxarm64_openblas
- libboost=1.84.0=hc9fb7c5_7
- libboost-python=1.84.0=py311h8fc16d6_7
- libbrotlicommon=1.1.0=hd74edd7_2
- libbrotlidec=1.1.0=hd74edd7_2
- libbrotlienc=1.1.0=hd74edd7_2
- libcblas=3.9.0=25_osxarm64_openblas
- libcxx=19.1.5=ha82da77_0
- libdeflate=1.22=hd74edd7_0
- libedit=3.1.20191231=hc8eb9b7_2
- libexpat=2.6.4=h286801f_0
- libffi=3.4.2=h3422bc3_5
- libgfortran=5.0.0=13_2_0_hd922786_3
- libgfortran5=13.2.0=hf226fd6_3
- libglib=2.82.2=h07bd6cf_0
- libiconv=1.17=h0d3ecfb_2
- libintl=0.22.5=h8414b35_3
- libjpeg-turbo=3.0.0=hb547adb_1
- liblapack=3.9.0=25_osxarm64_openblas
- liblzma=5.6.3=h39f12f2_1
- libopenblas=0.3.28=openmp_hf332438_1
- libpng=1.6.44=hc14010f_0
- libpq=16.6=hb008251_1
- librdkit=2024.03.5=h54a62e4_3
- libsqlite=3.47.0=hbaaea75_1
- libtiff=4.7.0=ha962b0a_2
- libwebp-base=1.4.0=h93a5062_0
- libxcb=1.17.0=hdb1d25a_0
- libzlib=1.3.1=h8359307_2
- llvm-openmp=19.1.5=hdb05f8b_0
- matplotlib-base=3.9.3=py311h031da69_0
- msgpack-python=1.1.0=py311h2c37856_0
- multiprocess=0.70.17=py311h917b07b_1
- munkres=1.1.4=pyh9f0ad1d_0
- ncurses=6.5=h7bae524_1
- networkx=3.4.2=pyh267e887_2
- numpy=1.26.4=py311h7125741_0
- openeye-toolkits=2024.2.0=py311_0
- openff-amber-ff-ports=0.0.4=pyhca7485f_0
- openff-forcefields=2024.09.0=pyhff2d567_0
- openff-qcsubmit=0.54.0=pyhd8ed1ab_0
- openff-toolkit-base=0.16.7=pyhd8ed1ab_0
- openff-units=0.2.2=pyhca7485f_0
- openff-utilities=0.1.13=pyhd8ed1ab_0
- openjpeg=2.5.3=h8a3d83b_0
- openssl=3.4.0=h39f12f2_0
- packaging=24.2=pyhd8ed1ab_2
- pandas=2.2.2=py311h4b4568b_1
- pcre2=10.44=h297a79d_2
- pillow=11.0.0=py311h3894ae9_0
- pint=0.23=pyhd8ed1ab_1
- pip=24.3.1=pyh8b19718_0
- pixman=0.44.2=h2f9eb0b_0
- pkgutil-resolve-name=1.3.10=pyhd8ed1ab_2
- pluggy=1.5.0=pyhd8ed1ab_1
- pthread-stubs=0.4=hd74edd7_1002
- pycairo=1.27.0=py311h84a5a08_0
- pycalverter=1.6.1=pyhd8ed1ab_1
- pycparser=2.22=pyh29332c3_1
- pydantic=2.10.3=pyh3cfb1c2_0
- pydantic-core=2.27.1=py311h3ff9189_0
- pyjwt=2.10.1=pyhd8ed1ab_0
- pyparsing=3.2.0=pyhd8ed1ab_2
- pysocks=1.7.1=pyha55dd90_7
- pytest=8.3.4=pyhd8ed1ab_1
- python=3.11.11=hc22306f_1_cpython
- python-constraint=1.4.0=py_0
- python-dateutil=2.9.0.post0=pyhff2d567_1
- python-tzdata=2024.2=pyhd8ed1ab_1
- python_abi=3.11=5_cp311
- pytz=2024.2=pyhd8ed1ab_1
- pyyaml=6.0.2=py311h460d6c5_1
- qcelemental=0.28.0=pyhd8ed1ab_1
- qcportal=0.56=pyhd8ed1ab_1
- qhull=2020.2=h420ef59_5
- rdkit=2024.03.5=py311h8a4e316_3
- readline=8.2=h92ec313_1
- referencing=0.35.1=pyhd8ed1ab_1
- regex=2024.11.6=py311h917b07b_0
- reportlab=4.2.5=py311h460d6c5_0
- requests=2.32.3=pyhd8ed1ab_1
- rlpycairo=0.2.0=pyhd8ed1ab_0
- rpds-py=0.22.3=py311h3ff9189_0
- setuptools=75.6.0=pyhff2d567_1
- six=1.17.0=pyhd8ed1ab_0
- smirnoff99frosst=1.1.0=pyh44b312d_0
- sqlalchemy=2.0.36=py311hae2e1ce_0
- sqlite=3.47.0=hcd14bea_1
- tabulate=0.9.0=pyhd8ed1ab_2
- tk=8.6.13=h5083fa2_1
- tomli=2.2.1=pyhd8ed1ab_1
- tqdm=4.67.1=pyhd8ed1ab_0
- typing-extensions=4.12.2=hd8ed1ab_1
- typing_extensions=4.12.2=pyha770c72_1
- tzdata=2024b=hc8b5060_0
- unicodedata2=15.1.0=py311hae2e1ce_1
- unidecode=1.3.8=pyh29332c3_1
- urllib3=2.2.3=pyhd8ed1ab_1
- wheel=0.45.1=pyhd8ed1ab_1
- xmltodict=0.14.2=pyhd8ed1ab_1
- xorg-libxau=1.0.11=hd74edd7_1
- xorg-libxdmcp=1.1.5=hd74edd7_0
- yaml=0.2.5=h3422bc3_2
- zipp=3.21.0=pyhd8ed1ab_1
- zstandard=0.23.0=py311ha60cc69_1
- zstd=1.5.6=hb46c0d2_0

Git LFS file not shown
Binary file not shown.
Loading

0 comments on commit 7f8ed2a

Please sign in to comment.