Determining the number of training points required for machine learning of potential energy surfaces.

In recent years, there has been an explosion in the use of machine learning, with applications across many fields. One application of interest to the computational chemistry field is the use of a method known as Gaussian processes to accurately derive a system's Potential Energy Surfaces (PES)...

Full description

Bibliographic Details
Main Author:	Pearson, Matt
Format:	Thesis (University of Nottingham only)
Language:	English
Published:	2021
Subjects:	machine learning support vector machines Gaussian processes
Online Access:	https://eprints.nottingham.ac.uk/65589/

_version_	1848800245299806208
author	Pearson, Matt
author_facet	Pearson, Matt
author_sort	Pearson, Matt
building	Nottingham Research Data Repository
collection	Online Access
description	In recent years, there has been an explosion in the use of machine learning, with applications across many fields. One application of interest to the computational chemistry field is the use of a method known as Gaussian processes to accurately derive a system's Potential Energy Surfaces (PES) from ab-initio input-output data. Gaussian processes are a stochastic process, or collection of data, each finite group of which has a multivariate distribution. When modelling the PES of a system with GPs, the cost of computation is proportional to the number of sample points, and in the interests of being economical it becomes imperative to use no more computing time than in necessary. When examining the $H_2O-H_2S$ system, 10,000 sample points was found to be insufficient to accurately model the PES, raising the question: how many points are needed, and what makes this system so challenging? The root mean squared error, or RMSE, provides a non-negative measure of the absolute fit of a model to sample data. PESs for a selection of different dimers were modelled using an LHC regime and a GP, and the RMSE tested against a set of test data. An LHC or Latin hypercube is a method of multidimensional distribution used to generate a near random sample of parameter values. From the RMSE data a parametric regression was implemented to find the number of sample points required $n_{req}$ to achieve a benchmark precision of $10^{-5}$ Hartrees $(E_h)$, and from a collection of these a correlation observed between the relative difficulty of a system and geometric and chemical characteristics of each system. An exponential correlation was observed between $n_{req}$ and number of Degrees of Freedom (DoF) of a system, making it the principal determinant of difficulty. A strong negative correlation was also observed between the number of permutations in a symmetry group and the difficulty of that system, with a distinction made between the effects of `flip' and `interchange' symmetries, which reduce the points required by 50\% and 34\% respectively. The difficulty of systems also positively correlates with energy well depth, atomic size and atomic size disparity, though these are not so easily unpicked and quantified. With DoF and symmetry in mind, a general equation for estimating $n_{req}$ was formulated, and a 6 DoF system was projected to require upwards of 32,000 sample points to achieve benchmark accuracy. Since the cost of calculating a PES of a system is proportional to the number of sample points included, and high performance computer time is limited, the ability to estimate $n_{req}$ permits better management of the computational effort. Moving forward, the methodology outlined may be used to appraise further systems of interest before committing processor time.
first_indexed	2025-11-14T20:48:30Z
format	Thesis (University of Nottingham only)
id	nottingham-65589
institution	University of Nottingham Malaysia Campus
institution_category	Local University
language	English
last_indexed	2025-11-14T20:48:30Z
publishDate	2021
recordtype	eprints
repository_type	Digital Repository
spelling	nottingham-655892024-02-05T14:49:08Z https://eprints.nottingham.ac.uk/65589/ Determining the number of training points required for machine learning of potential energy surfaces. Pearson, Matt In recent years, there has been an explosion in the use of machine learning, with applications across many fields. One application of interest to the computational chemistry field is the use of a method known as Gaussian processes to accurately derive a system's Potential Energy Surfaces (PES) from ab-initio input-output data. Gaussian processes are a stochastic process, or collection of data, each finite group of which has a multivariate distribution. When modelling the PES of a system with GPs, the cost of computation is proportional to the number of sample points, and in the interests of being economical it becomes imperative to use no more computing time than in necessary. When examining the $H_2O-H_2S$ system, 10,000 sample points was found to be insufficient to accurately model the PES, raising the question: how many points are needed, and what makes this system so challenging? The root mean squared error, or RMSE, provides a non-negative measure of the absolute fit of a model to sample data. PESs for a selection of different dimers were modelled using an LHC regime and a GP, and the RMSE tested against a set of test data. An LHC or Latin hypercube is a method of multidimensional distribution used to generate a near random sample of parameter values. From the RMSE data a parametric regression was implemented to find the number of sample points required $n_{req}$ to achieve a benchmark precision of $10^{-5}$ Hartrees $(E_h)$, and from a collection of these a correlation observed between the relative difficulty of a system and geometric and chemical characteristics of each system. An exponential correlation was observed between $n_{req}$ and number of Degrees of Freedom (DoF) of a system, making it the principal determinant of difficulty. A strong negative correlation was also observed between the number of permutations in a symmetry group and the difficulty of that system, with a distinction made between the effects of `flip' and `interchange' symmetries, which reduce the points required by 50\% and 34\% respectively. The difficulty of systems also positively correlates with energy well depth, atomic size and atomic size disparity, though these are not so easily unpicked and quantified. With DoF and symmetry in mind, a general equation for estimating $n_{req}$ was formulated, and a 6 DoF system was projected to require upwards of 32,000 sample points to achieve benchmark accuracy. Since the cost of calculating a PES of a system is proportional to the number of sample points included, and high performance computer time is limited, the ability to estimate $n_{req}$ permits better management of the computational effort. Moving forward, the methodology outlined may be used to appraise further systems of interest before committing processor time. 2021-08-04 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en cc_by https://eprints.nottingham.ac.uk/65589/1/03062021thesis.pdf Pearson, Matt (2021) Determining the number of training points required for machine learning of potential energy surfaces. MPhil thesis, University of Nottingham. machine learning support vector machines Gaussian processes
spellingShingle	machine learning support vector machines Gaussian processes Pearson, Matt Determining the number of training points required for machine learning of potential energy surfaces.
title	Determining the number of training points required for machine learning of potential energy surfaces.
title_full	Determining the number of training points required for machine learning of potential energy surfaces.
title_fullStr	Determining the number of training points required for machine learning of potential energy surfaces.
title_full_unstemmed	Determining the number of training points required for machine learning of potential energy surfaces.
title_short	Determining the number of training points required for machine learning of potential energy surfaces.
title_sort	determining the number of training points required for machine learning of potential energy surfaces.
topic	machine learning support vector machines Gaussian processes
url	https://eprints.nottingham.ac.uk/65589/

Determining the number of training points required for machine learning of potential energy surfaces.

Similar Items