Graphical model of proteins

Protein sequences are constrained both at individual residues (conservation) and in relation to each other (covariation). Selective pressures to maintain structure and function have constrained sequences over time and across species. Constraints thus manifested in sequence-structure-function relationships can be inferred from the evolutionary record, along with information from available structural studies and functional assays. Identified relationships can then be employed in all different `directions', e.g., to predict function from the sequence of a newly-discovered protein, discriminate predicted structures for a sequence according to functional tests, and design variant (homologous) protein sequences with related functions.

We are developing approaches to learn and use probabilistic graphical models (aka Markov random fields) that capture significant conservation and coupling observable in a multiply-aligned set of sequences. By incorporating structural information, our models can provide mechanistic explanations for observed constraints. By incorporating functional class information, they can perform interpretable classification of new sequences, explaining decisions in terms of the underlying conservation and coupling constraints. By incorporating information about interacting proteins, they can identify "cross-coupling" constraints and make explainable predictions about novel interactions. Finally, the models can be used generatively, to design new sequences consistent with the modeled constraints, and thus predicted to be folded and functional.