Skip to content

Commit

Permalink
Adding Reference to SMILES String Kernel
Browse files Browse the repository at this point in the history
  • Loading branch information
Ryan-Rhys authored Oct 11, 2024
1 parent d7facdd commit 6b3ba33
Showing 1 changed file with 7 additions and 5 deletions.
12 changes: 7 additions & 5 deletions notebooks/GP Regression on Molecules.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"source": [
"# GP Regression on Molecules #\n",
"\n",
"An example notebook for basic GP regression on a molecular dataset. We showcase two different GP models on the Photoswitch Dataset --- one using a Tanimoto kernel applied to fingerprint representations of the molecules and another also using a Tanimoto kernel but applied to a bag-of-characters representations of molecular SMILES strings (a.k.a the bag-of-SMILES model). Towards the end of the tutorial it is shown that the GP's uncertainty estimates are correlated with prediction error and can thus act as a criteria for prioritising molecules for laboratory synthesis.\n",
"An example notebook for basic GP regression on a molecular dataset. We showcase two different GP models on the Photoswitch Dataset --- one using a Tanimoto kernel applied to fingerprint representations of the molecules and another also using a Tanimoto kernel but applied to a bag-of-characters representations of molecular SMILES strings (a.k.a the bag-of-SMILES model). It should be noted that the bag-of-SMILES is equivalent to the SMILES string kernel in [1]. Towards the end of the tutorial it is shown that the GP's uncertainty estimates are correlated with prediction error and can thus act as a criteria for prioritising molecules for laboratory synthesis.\n",
"\n",
"Paper: https://pubs.rsc.org/en/content/articlelanding/2022/sc/d2sc04306h\n",
"\n",
Expand Down Expand Up @@ -62,7 +62,7 @@
"k_{\\text{Tanimoto}}(\\mathbf{x}, \\mathbf{x}') = \\sigma^2_{f}\\frac{<\\mathbf{x}, \\mathbf{x}'>}{||\\mathbf{x}||^2 + ||\\mathbf{x}'||^2 \\: - <\\mathbf{x}, \\mathbf{x}'>},\n",
"\\end{equation}\n",
"\n",
"where $\\mathbf{x} \\in \\mathbb{R}^D$ is a D-dimensional binary fingerprint vector i.e. components $\\mathbf{x}_i \\in \\{0, 1\\}$, $<\\cdot, \\cdot>$ is the Euclidean inner product, $||\\cdot||$ is the Euclidean norm and $\\sigma_{f}$ is a scalar kernel signal amplitude (vertical lengthscale) hyperparameter. One of the first instances of the Tanimoto kernel being used in conjunction with GP regression was in [1]. While common GP kernels that operate on continuous spaces can be applied to molecules, there is evidence to suggest that using an appropriate similarity metric for bit vectors yields improved performance [2]. \n",
"where $\\mathbf{x} \\in \\mathbb{R}^D$ is a D-dimensional binary fingerprint vector i.e. components $\\mathbf{x}_i \\in \\{0, 1\\}$, $<\\cdot, \\cdot>$ is the Euclidean inner product, $||\\cdot||$ is the Euclidean norm and $\\sigma_{f}$ is a scalar kernel signal amplitude (vertical lengthscale) hyperparameter. One of the first instances of the Tanimoto kernel being used in conjunction with GP regression was in [2]. While common GP kernels that operate on continuous spaces can be applied to molecules, there is evidence to suggest that using an appropriate similarity metric for bit vectors yields improved performance [4]. \n",
"\n"
]
},
Expand Down Expand Up @@ -654,12 +654,14 @@
"metadata": {},
"source": [
"## References \n",
"[1] D-S Cao, J-C Zhao, Y-N Yang, C-X Zhao, J Yan, S Liu, Q-N Hu, Q-S Xu, and Y-Z Liang. [In silico toxicity prediction by support vector machine and SMILES representation-based string kernel](https://pubmed.ncbi.nlm.nih.gov/22224501/). SAR and QSAR in Environmental Research, 2012.\n",
"\n",
"[1] Griffiths, R.R., Greenfield, J.L., Thawani, AR, Jamasb, A., Moss, H.B, Bourached, A., Jones, P., McCorkindale, W., Aldrick, A.A. Fuchter, M.J. and Lee, A.A., [Data-driven discovery of molecular photoswitches with multioutput Gaussian processes](https://pubs.rsc.org/en/content/articlehtml/2022/sc/d2sc04306h). Chemical Science 2022.\n",
"\n",
"[2] Bajusz, D., Rácz, A. and Héberger, K., 2015. [Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-015-0069-3). Journal of Cheminformatics, 7(1), pp.1-13.\n",
"[2] Griffiths, R.R., Greenfield, J.L., Thawani, AR, Jamasb, A., Moss, H.B, Bourached, A., Jones, P., McCorkindale, W., Aldrick, A.A. Fuchter, M.J. and Lee, A.A., [Data-driven discovery of molecular photoswitches with multioutput Gaussian processes](https://pubs.rsc.org/en/content/articlehtml/2022/sc/d2sc04306h). Chemical Science 2022.\n",
"\n",
"[3] Moriwaki, H., Tian, Y.S., Kawashita, N. and Takagi, T., 2018. [Mordred: a molecular descriptor calculator](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y?ref=https://githubhelp.com). Journal of Cheminformatics, 10(1), pp.1-14."
"[3] Bajusz, D., Rácz, A. and Héberger, K., 2015. [Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-015-0069-3). Journal of Cheminformatics, 7(1), pp.1-13.\n",
"\n",
"[4] Moriwaki, H., Tian, Y.S., Kawashita, N. and Takagi, T., 2018. [Mordred: a molecular descriptor calculator](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y?ref=https://githubhelp.com). Journal of Cheminformatics, 10(1), pp.1-14."
]
}
],
Expand Down

0 comments on commit 6b3ba33

Please sign in to comment.