Identifying functionally important regions in a protein
Problem statement: Given a protein, determine the regions (sets of residues) that are believed to be critical to the functioning of the protein. Besides identifying such regions, determine the relative importance of these regions and the specificity of the functions that they confer.
Hyptotheses: The basic hypotheses that our method is based on:
- functionally-important regions are conserved across a family of proteins.
- differences observed across subfamilies within a family correspond to mutations that confer subfamily-specifc functions.
Methodology: We are given a target protein, an MSA of this protein and its homologs and a phylogenetic classification of these proteins. We use an array of signals to provide us information on the functional importance:
- evolutionary trace - how conserved are residues at a given level of the tree [1]?
- 3D cluster analysis - do conserved residues occur near each other [2]?
- tree traversal - at different levels in a tree, what does it mean for a residue to be conserved?
Each of these signals gives us a set of scores that we would like to integrate to infer a residue's importance. To do this, we are considering the following probabilistic approaches:
- Use a simple boosting predictor to determine the weights for each signal.
- The second approach involves designing a graphical model for this problem. Our modelling results in a HMM like framework except that the hidden states form a graph as opposed to a chain. This graph is empirically a small-world graph which means that there is enough locality to still compute marginals efficiently. The problem that remains is to figure out what approximations can be made to the graph for tractability and how to do the parameter estimation.
Validation: Test on proteins with known critical residues e.g. the SH2 domain.
Status Update: We built a system in Perl to perform evolutionary trace, tree traversal and 3D clustering. We also wrote the probabilistic models in Scilab. We implemented three models - a full-blwn graphical model which models the protein as a graph, a Naive Bayes model with the residues considered independent but with the spatial clustering moved to the observation vector, and a continuous HMM model in which we treat the residues as a simple chain.
The full-blown model had implementation issues due to underflow and problems with parameter estimation. The other two models were then tested on a dataset that was obtained based on the Ligand-Protein contact dataset. The dataset had a total of 5253 proteins in 12 superfamilies. We found that the Naive Bayes and the Continous HMM model performed quite close to each other with both attaining a reasonable selectivity of 70% for 100% sensitivity. However, while reasonable, we can considerably improve upon these figures. The first question is to see how well the tree traversal gives us a good signal of residue importance. The second issue is to see if we can still work on the original protein graph. And the third issue is to ensure that we have good quality data.
Bibliography
- O. Lichtarge et al, An Evolutionary Trace Method Defines Binding Surfaces Common to Protein Families, J. Mol. Bio(1996) 257, 342-358.
- R. Landgraf et al, Three-dimensional Cluster Analysis Identifies Interfaces and Functional Residue Clusters in Proteins, J. Mol. Bio(2001) 307, 1487-1502.
Comments

Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback