In recent years, rapid increases in computing power have allowed for the use of quantum mechanical calculations in larger and more complex chemical systems. However, even on dedicated research computing clusters, such calculations can be quite time-consuming. Additionally, the computation time scales quadratically or cubically with the size of the chemical system being modelled. So when the research question demands thousands or even hundreds of thousands of calculations, one reasonaable question to ask is whether it is necessary to do every calculation. Instead of doing 100,000 calculations, what if we make 30,000 calculations, and then interpolate the rest by some means? If this proves feasible in practice, it would be an effective time-saving measure. The questions that naturally arise would be: what methods are most suitable, and how good a job could they do to predict quickly what would take a supercomputer 15-20 minutes or more to compute?
In this project I will attempt to answer this question by seeing how accurately the binding energy of the certain chemical system can be predicted using neural networks and linear regression.
The aforementioned chemical system is one that is significant to understanding gold catalysis. It consists of a cluster of gold atoms (on the nanometer scale - a nanocluster) and a single oxygen atom above the surface of the gold. The nanocluster is a slab made of layers of gold atoms packed together in a lattice; there are different possible arrangements for this packing. The slab extends infinitely in two dimensions. In the simplest calculations, all atoms are fixed, and the binding potential is calculated. Such a calculation currently takes around 20 minutes. The oxygen is then moved incrementally over the surface of the gold, and another calculation is done at each increment. Calculating the energy at many coordinates ultimately yields a spatial profile for the gold-oxygen interaction.
Figure 1: The chemical system being modelled is a gold nanocluster surface, with a lone oxygen atom situated above it. The purple spheres represent gold atoms in the slab, which extends infinitely on a horizontal plane (only one cell is shown, but this cell is repeated in the actual calculations by SeqQuest). The The top surface of the nanocluster has a step-like shape. The red sphere is an oxygen atom located at a specific position above the gold surface.
I intend to use locally weighted linear regression (LWLR) and artificial neural networks (ANN) to predict the binding energy. The inputs will, in the simple case, be the coordinates (x, y, z) of the oxygen atom, since the gold atoms on the surface are fixed. I also intend to try using the coordinates of the k nearest gold atoms as features, and see if the added information yields improved results. It will be interesting to see if the coordinates of the gold atoms improves the predictions, and if so,to then see what values of k yield the best results.
I will be using a dataset compiled by April Daigle, a graduate student in chemistry at Dartmouth College. The data was obtained using SeqQuest, a quantum chemistry code used to model periodic chemical systems. The dataset includes the coordinates for the all the gold atoms in the cell, the coordinates of the oxygen atom as it is moved around, and the binding energy at each location. The dataset includes approximately 100,000 datapoints, so having data for cross validation and testing will not be an issue.
By the milestone (5/11/10), I plan to have completed a literature review of machine learning techniques used for quantum mechanical computation. I also intend to have a basic implementation of the two algorithms (LWLR and neural networks). I will hopefully have some results for the simple case of just using the coordinates of the oxygen atom. This means having results for both methods, using cross validation for training and measuring performance on the test set.