next up previous contents
Next: Education Up: Computational Biology (Bruce Donald) Previous: Computational Biology (Bruce Donald)   Contents

Activities and Findings

A wealth of interesting computational problems arise in developing and applying information technology to understand the molecular machinery of the cell. We began to develop high-throughput, automated systems for structural and functional genomics. In particular, our goal is to design algorithms that require data measurements of only a few key biophysical parameters, which will be obtained from fast, minimal, cheap experiments. We utilize two classes of biophysical experiment: nuclear magnetic resonance (NMR) and mass spectrometry (MS). MS is highly amenable to automation and promises to scale to macromolecular complexes, but yields limited structural information. NMR yields a great deal of structure content, but is not easily automated. We began to bridge this gap through sophisticated information technology, obtaining more structural content from MS and bringing more automation to NMR. Our approach will allow the structure and function of biopolymers to be assayed at a fraction of the time and cost of current methods.

Structural genomics -- computing three-dimensional protein structure -- is a vital component of modern biology. Current techniques for computing structure using NMR spectroscopy require dozens of experiments and months of spectrometer time. We began to extend and adapt algorithms from computer vision, signal processing, and robotics in a system for determining structure from only a few key NMR spectra. We have begun by developing novel techniques for computing protein secondary structure and main-chain assignment from just four spectra. In our preliminary work, we represent NMR data with graphs encoding potential interactions between amino acid residues, and apply object recognition-like graph algorithms to uncover subgraphs encoding secondary structure, overcoming a 5-10% signal-to-noise ratio. We then use a probabilistic algorithmic framework to align identified secondary structure to the primary sequence. We plan to extend this system to a fully automated system for computing biopolymer structure from protein and DNA NMR data, using techniques from Expectation/Maximization, polyspectral analysis, and geometric reasoning to analyze input spectra, and graph algorithms to search for and exploit structural constraints. The novelty in our approach devolves to a minimalist emphasis designed for high-throughput assays, the employment of new techniques from computer science, and the application of computational methods from robotics, planning, and machine vision. JIGSAW views structure determination in noisy data as a form of object recognition and leverages algorithms from robotics and machine vision. It demonstrates two key insights: graph-based secondary structure pattern discovery, and assignment by alignment.

Structure is an integral part of function: an important task in functional genomics is the discovery of structure-activity relations (SAR), indicating binding modes of protein-protein and DNA-protein complexes. We began to develop techniques for determining structure-activity relations using new computational protocols for MS. SAR by MS enzymatically cleaves a cross-linked complex and analyzes the resulting mass spectrum for mass peaks of hypothesized fragments. Depending on the binding mode, some cleavage sites will be shielded; the absence of anticipated peaks for the corresponding fragments implicates those fragments as either part of the interaction region or inaccessible due to conformational change upon binding. Thus different mass spectra provide evidence for different structure-activity relations. An important problem in SAR by MS is the potential for mass degeneracy: when two possible fragments have approximately the same mass, the existence of one or the other cannot be uniquely inferred from a mass peak. Thus we began to study automated planning techniques, in the style of active sensing, that seek to maximize the potential information content of MS experiments and efficiently interpret the resulting data. These planning techniques determine isotopic mass manipulations that will yield unambiguous mass peaks for the resulting fragments. In preliminary work, we have proved the optimization form of the problem to be NP-complete, have tested a randomized planning algorithm and probabilistic framework for estimating interpretability, and have developed an efficient multi-spectral data analysis technique. We plan to apply computational models to predict binding modes, incorporate additional MS analysis techniques, extend the experiment planner to plan protease/endonuclease selection, and further validate and refine our techniques. We view experiment planning as a form of active sensing (from vision and robotics), where the the ``controls'' are selective isotopic labeling and limited proteolysis, and ``sensing'' is done by MS. This problem can be cast as a combinatorial optimization problem, with the objective of finding a labeling strategy that minimizes the amount of mass degeneracy (spectral overlap).

Fast structure determination and SAR by MS deliver complementary information, which we want to synergistically integrate into a complete high-throughput, automated system for structural and functional genomics. For example, a preliminary structure obtained from NMR data constrains the hypotheses on potential fragment interactions, allowing SAR by MS to focus on a smaller set of masses to disambiguate. Alternatively, SAR by MS can guide structure computation with spatial proximity information derived from binding mode. The information content in limited proteolysis provides additional synergy: potential cleavage sites in regular secondary structure have a smaller likelihood of being cleaved. Hence, secondary structure determination can guide prediction of fragments and interpretation of mass peaks, or vice-versa.

Our work integrates research and training of young scientists in the interdisciplinary field of computational molecular biology. We believe that information technology can provide leverage in structural and functional genomics, comparable to the impact of computational methods in gene sequencing and mapping. High-throughput automated structure and function will increase our understanding of biopolymer interactions in systems of significant biochemical and pharmacological interest, on a genomic scale. Our work will yield valuable results both in computational science, with the development and analysis of new algorithms, and in biochemistry, with computational systems useful for high-throughput structural and functional assays. This is a very fruitful area for the application of the robotics and vision techniques developed in our laboratory.

Two papers on the NMR work appear at the prestigious International Conference on Computational Molecular Biology (RECOMB'2000 and 2001). Another, on the MS work, appeared at the International Conference on Intelligent Systems for Molecular Biology in August, 2000. We also published two journal papers in the Journal of Computational Biology. One is on our NMR work, the other is on our Mass Spec work.

Based on this work, I was awarded a John Simon Guggenheim Memorial fellowship for 2001-2001 for my proposal entitled ``New Frontiers in Physical Geometric Algorithms,'' to apply robotics and vision algorithms to the challenges of structural proteomics.



Subsections
next up previous contents
Next: Education Up: Computational Biology (Bruce Donald) Previous: Computational Biology (Bruce Donald)   Contents
Last modified: 2005-04-06