Skip to content

usnistgov/TaML

Repository files navigation

Theory aware Machine Learning (TaML)

This repository supports the following manuscript

Debra J. Audus, Austin McDannald, and Brian DeCost, "Leveraging Theory for Enhanced Machine Learning" ACS Macro Letters 2022 11 (9), 1117-1122 DOI: 10.1021/acsmacrolett.2c00369,

which explores methods for incorporating imperfect theory into machine learning for improved prediction and explainability. Specifically, it focuses on the case study of the dimensions of a polymer chain, in this case the radius of gyration, in different solvent qualities. For machine learning models, three models are considered: Gaussian Process Regression with heteroscedastic noise, Gaussian Process Regression with homoscedastic noise and Random Forest. Of the three models, we encourage use of Gaussian Process Regression with heteroscedastic noise as it provides accurate uncertainty estimates.

Gaussian Process Regression with heteroscedastic noise relies on the GPFlow python package. However, since heteroscedastic noise is not natively implemented, we implement a derived class to add this functionality (see taml/GPRhetero.py). Gaussian Process Regression with homoscedastic noise is implemented natively with GPFlow. Random Forest is implemented using Scikit-learn.

The repository is intended for the following use cases:

  • Illustrate key ideas from the manuscript including incorporating theory and using Gaussian Process Regression with heteroscedastic noise (see notebooks/MethodComparison_GPR_HeteroscedasticNoise and the companion notebook without heteroscedastic noise notebooks/MethodComparison_GPR_HomoscedasticNoise
  • Provide code for Gaussian Process Regression with heteroscedastic noise (which can be used after installation with from taml.GPRhetero import GPRhetero).
  • Reproduce figures from our manuscript (see notebooks folder)
  • Allow for full reproducibility of the data in the manuscript

Running the code

All code is written in Python and requires Python >= 3.7. It can be used on any operating system. Other requirements are listed in requirements.txt.

If you are only interested in running the Jupyter Notebooks in Google Colab, you can skip ahead to Notebooks.

First clone the code via

git clone https://github.com/usnistgov/TaML.git

and navigate to the directory where the repository lives

cd TaML

Next, one needs to create a virtual environment. This can be done using Python virtual environments or with Anaconda. Both options are listed below.

Create a Python virtual environment (option 1)

First, make sure you are using Python 3.7 or later.

python3 -m venv env

where env is the location of the virtual environment

Activate the virtual environment

source env/bin/activate

Install dependencies

python3 -m pip install -r requirements.txt

Create a virtual environment with Anaconda (option 2)

First, install conda.

conda env create -f environment.yml

If you are using conda>=4.6, activate the virtual environment via

conda activate TaML

Otherwise, see the conda docs

GPFlow 2.2.1 is not available on conda channels and must be installed via pip

pip install gpflow==2.2.1

Install the TaML package

For users who wish to use the source code or import functions, the TaML package can be installed via

pip install .

Notebooks

Included notebooks include DataVisualization for visualizing the input data used for machine learning, MethodComparison_GPR_HeteroscedasticNoise for comparing different methods for incorporating theory into machine learning using Gaussian Process Regression with heteroscedastic noise, MethodComparison_GPR_HomoscedasticNoise for comparing different methods for incorporating theory into machine learning using Gaussian Process Regression with homoscedastic noise, and ViewResults for plotting the relative performance of different methods for incorporating theory into machine learning for three different machine learning models.

Running notebooks locally (option 1)

For users interested in testing ideas, we recommend focusing on the MethodComparison_GPR_HeteroscedasticNoise notebook as it explores the different methods and takes into account the known uncertainties in the input data.

If you cloned the repository, the Jupyter notebooks can by run by navigating to the notebook folder and using the command

jupyter notebook

Running notebooks in Google Colab (option 2)

If you are interested in running one or more notebooks in Google Colab, first click on the relevant link below. Note that these links were generated by navigating to the notebook of interest on the TaML GitHub page, for example, https://github.com/usnistgov/TaML/blob/main/notebooks/MethodComparison_GPR_HeteroscedasticNoise.ipynb and then replace github.com with githubtocolab.com.

This should open the notebook in Google Colab. For the DataVisualization and ViewResults notebooks, all dependencies are likely available and you should be able to directly run them. For the MethodComparison_GPR_HeteroscedasticNoise and MethodComparison_GPR_HomoscedasticNoise notebooks, you must install GPFlow. This can be accomplished by

(1) uncommenting out the code block

!pip install gpflow==2.2.1

(2) executing the code block

(3) restarting the run time environment (there should be a button at the bottom of the output for that code block).

Then you can run the notebook as normal.

Source code

The source code (see the taml folder) compares a variety of methods for incorporating theory into machine learning for three different machine learning models: Gaussian Process Regression with heteroscedastic noise, Gaussian Process Regression with homoscedastic noise and Random Forest. The output of the files can be plotted by modifying the notebook title ViewResults such that the data files are pulled from a local run as opposed to the stored data.

To run the source code

python3 -m taml

Contact

Debra J. Audus, PhD
Polymer Analytics Project
Materials Science and Engineering Division
Material Measurement Laboratory
National Institute of Standards and Technology

Email: [email protected]
GithubID: @debraaudus
Project website: https://www.nist.gov/programs-projects/polymer-analytics
Staff website: https://www.nist.gov/people/debra-audus

How to cite

If you use the code, please cite our manuscript:

Debra J. Audus, Austin McDannald, and Brian DeCost ACS Macro Letters 2022 11 (9), 1117-1122 DOI: 10.1021/acsmacrolett.2c00369

If you use the data, please cite:

Audus, Debra, MacDannald, Austin, DeCost, Brian (2022), Theory aware Machine Learning (TaML), National Institute of Standards and Technology, https://doi.org/10.18434/mds2-2637 (Accessed YYYY-MM-DD)