Using Hidden Markov Models for Protein Sequence Prediction

by David Wagner

http://www.cs.dartmouth.edu/~dwagn/compbio/proj/



This Project is an attempt to develop a new algorithm to predict the Amino Acid sequence of a protein. The algorithm is given information about its structure, and must decide which amino acid appears in which location of the sequence. Several algorithms exist already which do similar tasks, including (Dahiyat) and (Jiang).

The objectives of these algoritms are similar. Given the structure of a protein, such as this insulin:

find the amino acids that make up that protein:

(Dahiyat) builds a 'de Novo' sequence through a combination of an energy function, and dead-end elimination (Desmet). The sequence constructed from this algorithm may bear little resemblance to any native protein sequence. In fact this may be more stable than a similarly shaped native protein.

(Jiang) develops a new energy function which combines the forces of 'bumps' or steric hinderance, with conformational entropy, and thermodynamics. The use of a simulated annealing algorithm helps to arrive at a mostly optimal energy. Here it is assumed that the exposed amino acids are known. and the core amino acid are left for prediction. This algorithm is mostly successful in this task. If allowances are made for some amino acids which are quite similar in feature, then its prediction is quite accurate.

Our algorithm attempts to apply a Hidden Markov Model (HMM) to this problem. Instead of using an energy function, we gather statistical data to 'train' the hidden markov model to be able to predict the next amino acid in a sequence. The HMM uses information about the Phi and Psi angles, as well as the adjacent amino acid, in its prediction.

The Program

To begin this project we downloaded several sequences from the Brookhaven Protein Data Bank. We then wrote a program to read these files and calculate the phi and psi angles. This program turned out to be useful in its own right for reading information about these files. This information can be obtained with the command:

SeqCalc -report filename

This also proved to be a rather informative experience as we were required to learn about exactly how phi and psi are calculated. (Bernstein) describes the format of these files, and (Foley) gives an excellent description of how to perform rotations and translations of points in space. These procedures were used to calculate the phi and psi angles. (Stryer) shows a pictoral example of Phi and Psi. Read samplerun0.txt for an example of this part of the program displaying information about a zinc finger .

Hidden Markov Models

A Hidden Markov Model can be characerized generally as a finite automata with probabilities on its edges. The specifics of this vary, so we will use the definition from (Rabiner). The states of a HMM correspond to some hidden feature that we cannot see but might like to predict. All we can see are the output symbols. In this version of the HMM, there is a series of output symbols associated with each state. All output symbols are displayed as a result of being in a state which can display that symbol.

The edges represent transitions between states. Between each output symbol a transition is taken. All the outgoing edges from a state are weighted probabilistically, as are all the output symbols associated with a state. The challenge of this model is to find the most likely series of states that yields the observed output symbols.

Training

In order to assign weights to the edges and output symbols of our HMM, we decided to collect sample data and use it as a representative of the probabilities. This program was next extended to output these training samples for the HMM. A training sequence consists of a quadruple containing:

Preceding AA, Phi, Psi, Current AA

The objective here is to predict the Current AA, given the three preceding pieces of information. In an ideal model, each quadruple will eventually be assigned a probability, such that all quadruples with the same three initial parameters will sum to 1. This would give us a complete probability model for every input. Our model will differ slightly, by separating Preceding AA, from from Psi and Phi, so as to conform with the HMM specifications. To speed up the calculation, the Psi and Phi angles will be divided into reasonable intervals of 18 degrees each. The following command will create a training file with all the quadurples from the input file:

SeqCalc -train filename

Multiple invocations of this command will extend the training file.

Implementation

Next the Hidden Markov Model itself was implemented. There is a state for each Amino Acid, and the edges between the states with probabilities as specified by the training. The hidden Markov model applies to this situation, because we do not know which state (amino acid) we are in at any time, and we are relying on an observed sequence of values (psi and phi angles) to predict the sequence of states. We also need to add the probability of seeing a particular pair of angles in each state. (Rabiner) gives a good description of building HMM's. Here is what a part of our HMM might look like:


A Hidden Markov Model
States represent Amino Acids while edges represent the probability
that one Amino Acid will follow another

The Viterbi Algoritm

A well know algorithm used with HMM's is the Viterbi algorithm. This is a dynamic programming algorthm which finds the most probable sequence of states, given an observed sequence. This algorithm is ideally suited for our task. We implemented the Viterbi algorithm, and recorded the most probable sequence as our prediction. (Rabiner) gives a good description of the Viterbi algorithm.

Results

Single pass prediction

To begin testing this program, we started by training the HMM with a single protein, and then tried to use this to predict the same protein. This test yielded mixed results. The prediction can be performed with the command:

SeqCalc -predict filename

For some proteins, the sequence was able to match the original sequence over a substantial percentage. The example samplerun1.txt shows the Equine Herpes Virus is predicted with fair accuracy. Although the algorithm frequently gets lost, it is often able to pick up the sequence again. Usually, where it loses the sequence corresponds with alternate training on the same starting amino acid. Notice here that GLU is incorrectly predicted in the third position following an ALA, because there is a ALA-GLU sequence shortly afterwards in the native sequence. Similar ALA-GLU mispredictions recur elsewhere.

The algorithm performed much more poorly with other sequences. For example, in samplerun2.txt the algorithm predicts LYS in almost every position of the zinc finger. However, this may be understandable, due to the multiple LYS-LYS pairs in the original sequence.

Although these results are less than satisfactory at prediction, as we should expect without much training, they serve to show that the algorithm is performing as intended. We can hope that some global relation between adjacent pairs, and the phi and psi angles appearing after substantial training, will improve the prediction record.

Prediction after training

Finally, we trained the HMM on some 30 proteins chosen at random from the Brookhaven protein data bank. This testbed contains over 6000 amino acids. Although the HMM contains 400 edges, and 4000 angle entries, it is our hope that the clumpy nature of the data will fill in most of the important areas sufficiently. A printout of the HMM revaeals that the possible angles tend to be clustered around the region where phi=[-80, -120] and psi=[-40,180]. Indeed only Glysine (Node 5) and Asparagine (Node 11) deviate from this region with any frequency.

When the trained HMM is tested on a new protein ( samplerun4.txt) it seems to accurately predict the Glysine's, but little else. Perhaps it is this unique flexility of Glysine that makes it readily identifiable.

When run on one of the proteins used in training ( samplerun3.txt) it does not perform much better. There are a few short strings of accurate prediction, interspersed by mostly incorrect entries.

This example also demonstrates a remarkable tendency to fixate on a particular amino acid, in this case Alanine. In fact, it seems to get stuck on Alanine for the last three quarters of the sequence. Regrettably, the algorithm seem unable to even predict the sequence it was trained on.

Conclusions

Althouth the tendency to fixate may point to an error in the algorithm, or even the need for more training, it appears so far that the hidden markov model is not useful for predicting an amino acid sequence when the only information known is its Psi and Phi angle. However the semi-accurate prediction of Glysine points to the ability of the algorithm to take advantage of a unique feature of a particular amino acid. If the amino acids could each be characterized differently then there may be some hope for this method. Some additional parameters that may help with this are, whether the residue lies on a solvent accessible surface (which would classify the residue as hydrophobic or hydrophilic), the tendency of the C-beta to point towards the core or not, the amount of space available for the side chain (some amino acids have much more massive side chains than others), and position within alpha-helices or beta-sheets.

Source, etc.

References