Pearson, Matt
(2021)
Determining the number of training points required for machine learning of potential energy surfaces.
MPhil thesis, University of Nottingham.
Abstract
In recent years, there has been an explosion in the use of machine learning, with applications across many fields. One application of interest to the computational chemistry field is the use of a method known as Gaussian processes to accurately derive a system's Potential Energy Surfaces (PES) from abinitio inputoutput data. Gaussian processes are a stochastic process, or collection of data, each finite group of which has a multivariate distribution.
When modelling the PES of a system with GPs, the cost of computation is proportional to the number of sample points, and in the interests of being economical it becomes imperative to use no more computing time than in necessary.
When examining the $H_2OH_2S$ system, 10,000 sample points was found to be insufficient to accurately model the PES, raising the question: how many points are needed, and what makes this system so challenging?
The root mean squared error, or RMSE, provides a nonnegative measure of the absolute fit of a model to sample data. PESs for a selection of different dimers were modelled using an LHC regime and a GP, and the RMSE tested against a set of test data. An LHC or Latin hypercube is a method of multidimensional distribution used to generate a near random sample of parameter values. From the RMSE data a parametric regression was implemented to find the number of sample points required $n_{req}$ to achieve a benchmark precision of $10^{5}$ Hartrees $(E_h)$, and from a collection of these a correlation observed between the relative difficulty of a system and geometric and chemical characteristics of each system.
An exponential correlation was observed between $n_{req}$ and number of Degrees of Freedom (DoF) of a system, making it the principal determinant of difficulty. A strong negative correlation was also observed between the number of permutations in a symmetry group and the difficulty of that system, with a distinction made between the effects of `flip' and `interchange' symmetries, which reduce the points required by 50\% and 34\% respectively. The difficulty of systems also positively correlates with energy well depth, atomic size and atomic size disparity, though these are not so easily unpicked and quantified. With DoF and symmetry in mind, a general equation for estimating $n_{req}$ was formulated, and a 6 DoF system was projected to require upwards of 32,000 sample points to achieve benchmark accuracy.
Since the cost of calculating a PES of a system is proportional to the number of sample points included, and high performance computer time is limited, the ability to estimate $n_{req}$ permits better management of the computational effort. Moving forward, the methodology outlined may be used to appraise further systems of interest before committing processor time.
Actions (Archive Staff Only)

Edit View 