
Research
Abstracts  2006 
Learning BiophysicallyMotivated Parameters for Alpha Helix PredicationBlaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk & Srinivas DevadasIntroductionOur goal is to develop a stateoftheart protein secondary structure predictor, with an intuitive and biophysicallymotivated energy model. The lack of experimentally determined free energy values makes it difficult to design accurate cost functions that can be optimized by predictors. Our technique uses a cost function comprised of unknown parameters, and applies Support Vector Machines (SVMs) to learn parameters that correctly predict known protein structures. So far, we have focused on the prediction of allalpha proteins and have shown that a model with 302 parameters can achieve a Q_{alpha} value (percent of correctly predicted residues) of 77.6% and a SOV99_{alpha} (see [1]) value of 73.4%. As detailed in a recent technical report [2] and poster [3], these performance numbers are among the best for techniques that do not rely on additional databases (for example, multiple sequence alignments). MethodOur method assumes that a protein's secondary structure can be found by minimizing a freeenergy function G' that is computed as a sum of elementary freeenergies. For example, an elementary freeenergy might represent the energetic cost of a given residue appearing in an alpha helix. The predicted structure minimizes the sum of these costs. For fixed values of the freeenergy parameters, there are wellknown algorithms to perform such a minimization (for example, dynamic programming). Thus, our main task is to find the unknown elementary freeenergies by using a database of known protein structures. A general Support Vector Machine algorithm has been proposed that can be applied to this [4]. The key ideas of the algorithm are illustrated in the figure below. First, the problem is converted into an exponentially large system of inequalities that the elementary freeenergies must satisfy: for each sequence x_{i}, the correct secondary structure y_{i} must have a lower freeenergy G'(x_{i}, y_{i}) than for any of the incorrect secondary structures y^{j}. Next, a tractable subset of these inequalities is selected. This subset may not have any solutions, because there might not exist a set of freeenergies that is compatible with the whole database of training structures. Alternately, if the problem does have solutions, it will probably have many. The SVM techniques of margin maximization and slack variables are used to translate the reduced problem into a quadratic program that has a unique solution. The quadratic program is solved to produce a candidate set of elementary freeenergies, which is used by a structure predictor (not shown) to predict a new structure for each sequence x_{i}. If any of these structures are incorrect, they are added to the subset of incorrect structures y^{j}, a new quadratic program is generated, and a new candidate solution is found. We repeat these steps until all generated structures are correct (or in practice, a suitable termination condition is satisfied). ResultsWe have applied this method to the case of allalpha protein secondary structure prediction. We worked with a set of 300 nonhomologous allalpha proteins taken from EVA's largest sequenceunique subset of the PDB at the end of July 2005. For each run of our algorithm, we randomly selected a 150 protein training set and an independent 150 protein test set. The training set is used to learn the elementary freeenergies, and the test set is used to evaluate the result. Our predictor minimizes the freeenergy function G' using the Viterbi algorithm on a simple 7state Finite State Machine. The following table summarizes our results. The prediction accuracy is competitive with other stateoftheart predictors that do not rely on sequence alignment data. Moreover, while some techniques require upwards of 10,000 parameters, our predictor uses only 302 parameters in the form of elementary freeenergies [2].
ConclusionsThis work is a promising first pass at using SVM techniques to find the elementary freeenergies needed to predict protein secondary structure. The method we use is general and can be extended beyond the allalpha case described here. In future work, we plan to extend this method to supersecondary structure prediction, generating contact maps of individual hydrogen bonds in beta sheets. References:[1] A. Zemla, Ceslovas Venclovas, K. Fidelis, and B. Rost. A Modified Definition of Sov, a SegmentBased Measure for Protein Secondary Structure Prediction Assessment. Proteins, 34(2), 1999. [2] Blaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, Srinivas Devadas. Secondary Structure Prediction of AllHelical Proteins Using Hidden Markov Support Vector Machines MIT CSAIL Technical Report 1003 (MITLCSTR1003) [3] Blaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, Srinivas Devadas. Learning BiophysicallyMotivated Parameters for Alpha Helix Prediction. Poster at the 10th Annual International Conference on Research in Computational Molecular Biology, Venice Lido, Italy, 2006. [4] I. Tsochantaridis, T. Hofmann, T. Joachims and Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. In Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, 2004. 

