Determining the number of training points required for machine learning of potential energy surfaces.

In recent years, there has been an explosion in the use of machine learning, with applications across many fields. One application of interest to the computational chemistry field is the use of a method known as Gaussian processes to accurately derive a system's Potential Energy Surfaces (PES)...

Full description

Bibliographic Details
Main Author: Pearson, Matt
Format: Thesis (University of Nottingham only)
Language:English
Published: 2021
Subjects:
Online Access:https://eprints.nottingham.ac.uk/65589/
_version_ 1848800245299806208
author Pearson, Matt
author_facet Pearson, Matt
author_sort Pearson, Matt
building Nottingham Research Data Repository
collection Online Access
description In recent years, there has been an explosion in the use of machine learning, with applications across many fields. One application of interest to the computational chemistry field is the use of a method known as Gaussian processes to accurately derive a system's Potential Energy Surfaces (PES) from ab-initio input-output data. Gaussian processes are a stochastic process, or collection of data, each finite group of which has a multivariate distribution. When modelling the PES of a system with GPs, the cost of computation is proportional to the number of sample points, and in the interests of being economical it becomes imperative to use no more computing time than in necessary. When examining the $H_2O-H_2S$ system, 10,000 sample points was found to be insufficient to accurately model the PES, raising the question: how many points are needed, and what makes this system so challenging? The root mean squared error, or RMSE, provides a non-negative measure of the absolute fit of a model to sample data. PESs for a selection of different dimers were modelled using an LHC regime and a GP, and the RMSE tested against a set of test data. An LHC or Latin hypercube is a method of multidimensional distribution used to generate a near random sample of parameter values. From the RMSE data a parametric regression was implemented to find the number of sample points required $n_{req}$ to achieve a benchmark precision of $10^{-5}$ Hartrees $(E_h)$, and from a collection of these a correlation observed between the relative difficulty of a system and geometric and chemical characteristics of each system. An exponential correlation was observed between $n_{req}$ and number of Degrees of Freedom (DoF) of a system, making it the principal determinant of difficulty. A strong negative correlation was also observed between the number of permutations in a symmetry group and the difficulty of that system, with a distinction made between the effects of `flip' and `interchange' symmetries, which reduce the points required by 50\% and 34\% respectively. The difficulty of systems also positively correlates with energy well depth, atomic size and atomic size disparity, though these are not so easily unpicked and quantified. With DoF and symmetry in mind, a general equation for estimating $n_{req}$ was formulated, and a 6 DoF system was projected to require upwards of 32,000 sample points to achieve benchmark accuracy. Since the cost of calculating a PES of a system is proportional to the number of sample points included, and high performance computer time is limited, the ability to estimate $n_{req}$ permits better management of the computational effort. Moving forward, the methodology outlined may be used to appraise further systems of interest before committing processor time.
first_indexed 2025-11-14T20:48:30Z
format Thesis (University of Nottingham only)
id nottingham-65589
institution University of Nottingham Malaysia Campus
institution_category Local University
language English
last_indexed 2025-11-14T20:48:30Z
publishDate 2021
recordtype eprints
repository_type Digital Repository
spelling nottingham-655892024-02-05T14:49:08Z https://eprints.nottingham.ac.uk/65589/ Determining the number of training points required for machine learning of potential energy surfaces. Pearson, Matt In recent years, there has been an explosion in the use of machine learning, with applications across many fields. One application of interest to the computational chemistry field is the use of a method known as Gaussian processes to accurately derive a system's Potential Energy Surfaces (PES) from ab-initio input-output data. Gaussian processes are a stochastic process, or collection of data, each finite group of which has a multivariate distribution. When modelling the PES of a system with GPs, the cost of computation is proportional to the number of sample points, and in the interests of being economical it becomes imperative to use no more computing time than in necessary. When examining the $H_2O-H_2S$ system, 10,000 sample points was found to be insufficient to accurately model the PES, raising the question: how many points are needed, and what makes this system so challenging? The root mean squared error, or RMSE, provides a non-negative measure of the absolute fit of a model to sample data. PESs for a selection of different dimers were modelled using an LHC regime and a GP, and the RMSE tested against a set of test data. An LHC or Latin hypercube is a method of multidimensional distribution used to generate a near random sample of parameter values. From the RMSE data a parametric regression was implemented to find the number of sample points required $n_{req}$ to achieve a benchmark precision of $10^{-5}$ Hartrees $(E_h)$, and from a collection of these a correlation observed between the relative difficulty of a system and geometric and chemical characteristics of each system. An exponential correlation was observed between $n_{req}$ and number of Degrees of Freedom (DoF) of a system, making it the principal determinant of difficulty. A strong negative correlation was also observed between the number of permutations in a symmetry group and the difficulty of that system, with a distinction made between the effects of `flip' and `interchange' symmetries, which reduce the points required by 50\% and 34\% respectively. The difficulty of systems also positively correlates with energy well depth, atomic size and atomic size disparity, though these are not so easily unpicked and quantified. With DoF and symmetry in mind, a general equation for estimating $n_{req}$ was formulated, and a 6 DoF system was projected to require upwards of 32,000 sample points to achieve benchmark accuracy. Since the cost of calculating a PES of a system is proportional to the number of sample points included, and high performance computer time is limited, the ability to estimate $n_{req}$ permits better management of the computational effort. Moving forward, the methodology outlined may be used to appraise further systems of interest before committing processor time. 2021-08-04 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en cc_by https://eprints.nottingham.ac.uk/65589/1/03062021thesis.pdf Pearson, Matt (2021) Determining the number of training points required for machine learning of potential energy surfaces. MPhil thesis, University of Nottingham. machine learning support vector machines Gaussian processes
spellingShingle machine learning
support vector machines
Gaussian processes
Pearson, Matt
Determining the number of training points required for machine learning of potential energy surfaces.
title Determining the number of training points required for machine learning of potential energy surfaces.
title_full Determining the number of training points required for machine learning of potential energy surfaces.
title_fullStr Determining the number of training points required for machine learning of potential energy surfaces.
title_full_unstemmed Determining the number of training points required for machine learning of potential energy surfaces.
title_short Determining the number of training points required for machine learning of potential energy surfaces.
title_sort determining the number of training points required for machine learning of potential energy surfaces.
topic machine learning
support vector machines
Gaussian processes
url https://eprints.nottingham.ac.uk/65589/