This repository supports the following manuscript
Debra J. Audus, Austin McDannald, and Brian DeCost, "Leveraging Theory for Enhanced Machine Learning" ACS Macro Letters 2022 11 (9), 1117-1122 DOI: 10.1021/acsmacrolett.2c00369,
which explores methods for incorporating imperfect theory into machine learning for improved prediction and explainability. Specifically, it focuses on the case study of the dimensions of a polymer chain, in this case the radius of gyration, in different solvent qualities. For machine learning models, three models are considered: Gaussian Process Regression with heteroscedastic noise, Gaussian Process Regression with homoscedastic noise and Random Forest. Of the three models, we encourage use of Gaussian Process Regression with heteroscedastic noise as it provides accurate uncertainty estimates.
Gaussian Process Regression with heteroscedastic noise relies on the GPFlow python package. However, since heteroscedastic noise is not natively implemented, we implement a derived class to add this functionality (see taml/GPRhetero.py
). Gaussian Process Regression with homoscedastic noise is implemented natively with GPFlow. Random Forest is implemented using Scikit-learn.
The repository is intended for the following use cases:
- Illustrate key ideas from the manuscript including incorporating theory and using Gaussian Process Regression with heteroscedastic noise (see
notebooks/MethodComparison_GPR_HeteroscedasticNoise
and the companion notebook without heteroscedastic noisenotebooks/MethodComparison_GPR_HomoscedasticNoise
- Provide code for Gaussian Process Regression with heteroscedastic noise (which can be used after installation with
from taml.GPRhetero import GPRhetero
). - Reproduce figures from our manuscript (see
notebooks
folder) - Allow for full reproducibility of the data in the manuscript
All code is written in Python and requires Python >= 3.7. It can be used on any operating system. Other requirements are listed in requirements.txt
.
If you are only interested in running the Jupyter Notebooks in Google Colab, you can skip ahead to Notebooks.
First clone the code via
git clone https://github.com/usnistgov/TaML.git
and navigate to the directory where the repository lives
cd TaML
Next, one needs to create a virtual environment. This can be done using Python virtual environments or with Anaconda. Both options are listed below.
First, make sure you are using Python 3.7 or later.
python3 -m venv env
where env
is the location of the virtual environment
Activate the virtual environment
source env/bin/activate
Install dependencies
python3 -m pip install -r requirements.txt
First, install conda.
conda env create -f environment.yml
If you are using conda>=4.6, activate the virtual environment via
conda activate TaML
Otherwise, see the conda docs
GPFlow 2.2.1 is not available on conda channels and must be installed via pip
pip install gpflow==2.2.1
For users who wish to use the source code or import functions, the TaML package can be installed via
pip install .
Included notebooks include DataVisualization
for visualizing the input data used for machine learning, MethodComparison_GPR_HeteroscedasticNoise
for comparing different methods for incorporating theory into machine learning using Gaussian Process Regression with heteroscedastic noise, MethodComparison_GPR_HomoscedasticNoise
for comparing different methods for incorporating theory into machine learning using Gaussian Process Regression with homoscedastic noise, and ViewResults
for plotting the relative performance of different methods for incorporating theory into machine learning for three different machine learning models.
For users interested in testing ideas, we recommend focusing on the MethodComparison_GPR_HeteroscedasticNoise
notebook as it explores the different methods and takes into account the known uncertainties in the input data.
If you cloned the repository, the Jupyter notebooks can by run by navigating to the notebook folder and using the command
jupyter notebook
If you are interested in running one or more notebooks in Google Colab, first click on the relevant link below. Note that these links were generated by navigating to the notebook of interest on the TaML GitHub page, for example, https://github.com/usnistgov/TaML/blob/main/notebooks/MethodComparison_GPR_HeteroscedasticNoise.ipynb
and then replace github.com
with githubtocolab.com
.
- MethodComparison_GPR_HeteroscedasticNoise
- MethodComparison_GPR_HomoscedasticNoise
- DataVisualization
- ViewResults
This should open the notebook in Google Colab. For the DataVisualization
and ViewResults
notebooks, all dependencies are likely available and you should be able to directly run them. For the MethodComparison_GPR_HeteroscedasticNoise
and MethodComparison_GPR_HomoscedasticNoise
notebooks, you must install GPFlow. This can be accomplished by
(1) uncommenting out the code block
!pip install gpflow==2.2.1
(2) executing the code block
(3) restarting the run time environment (there should be a button at the bottom of the output for that code block).
Then you can run the notebook as normal.
The source code (see the taml
folder) compares a variety of methods for incorporating theory into machine learning for three different machine learning models: Gaussian Process Regression with heteroscedastic noise, Gaussian Process Regression with homoscedastic noise and Random Forest. The output of the files can be plotted by modifying the notebook title ViewResults
such that the data files are pulled from a local run as opposed to the stored data.
To run the source code
python3 -m taml
Debra J. Audus, PhD
Polymer Analytics Project
Materials Science and Engineering Division
Material Measurement Laboratory
National Institute of Standards and Technology
Email: [email protected]
GithubID: @debraaudus
Project website: https://www.nist.gov/programs-projects/polymer-analytics
Staff website: https://www.nist.gov/people/debra-audus
If you use the code, please cite our manuscript:
Debra J. Audus, Austin McDannald, and Brian DeCost ACS Macro Letters 2022 11 (9), 1117-1122 DOI: 10.1021/acsmacrolett.2c00369
If you use the data, please cite:
Audus, Debra, MacDannald, Austin, DeCost, Brian (2022), Theory aware Machine Learning (TaML), National Institute of Standards and Technology, https://doi.org/10.18434/mds2-2637 (Accessed YYYY-MM-DD)