Chris Bailey-Kellogg
Computer Science
Dartmouth


home
research
publications
lab
teaching
personal
Selected publication abstracts

  • X. Ye, A.M. Friedman, and C. Bailey-Kellogg, "Optimizing Bayes error for protein structure model selection by stability mutagenesis", Proc. CSB, 2008, to appear.

    Site-directed mutagenesis affects protein stability in a manner dependent on the local structural environment of the mutated residue; e.g., a hydrophobic to polar substitution would behave differently in the core vs. on the surface of the protein. Thus site-directed mutagenesis followed by stability measurement enables evaluation of and selection among predicted structure models, based on consistency between predicted and experimental stability changes (ΔΔG values). This paper develops a method for planning a set of individual site-directed mutations for protein structure model selection, so as to minimize the Bayes error, i.e., the probability of choosing the wrong model. While in general it is hard to calculate exactly the multi-dimensional Bayes error defined by a set of mutations, we leverage the structure of "ΔΔG space" to develop tight upper and lower bounds. We further develop a lower bound on the Bayes error of any plan that uses a fixed number of mutations from a set of candidates. We use this bound in a branch-and-bound planning algorithm to find optimal and near-optimal plans. We demonstrate the significance and effectiveness of this approach in planning mutations for elucidating the structure of the pTfa chaperone protein from bacteriophage lambda.

  • F. Xiong, G. Pandurangan, and C. Bailey-Kellogg, "Contact replacement for NMR resonance assignment", Proc. ISMB, 2008, to appear.

    Motivation: Complementing its traditional role in structural studies of proteins, nuclear magnetic resonance (NMR) spectroscopy is playing an increasingly important role in functional studies. NMR dynamics experiments characterize motions involved in target recognition, ligand binding, etc., while NMR chemical shift perturbation experiments identify and localize protein-protein and protein-ligand interactions. The key bottleneck in these studies is to determine the backbone resonance assignment, which allows spectral peaks to be mapped to specific atoms. This paper develops a novel approach to address that bottleneck, exploiting an available x-ray structure or homology model to assign the entire backbone from a set of relatively fast and cheap NMR experiments.

    Results: We formulate contact replacement for resonance assignment as the problem of computing correspondences between a contact graph representing the structure and an NMR graph representing the data; the NMR graph is a significantly corrupted, ambiguous version of the contact graph. We first show that by combining connectivity and amino acid type information, and exploiting the random structure of the noise, one can provably determine unique correspondences in polynomial time with high probability, even in the presence of significant noise (a constant number of noisy edges per vertex). We then detail an efficient randomized algorithm and show that, over a variety of experimental and synthetic datasets, it is robust to typical levels of structural variation (1-2 Angstrom), noise (250-600%) and missings (10-40%). Our algorithm achieves very good overall assignment accuracy, above 80% in alpha helices, 70% in beta sheets, and 60% in loop regions.

    preprint.

  • W. Zheng, A.M. Friedman, and C. Bailey-Kellogg, "Algorithms for joint optimization of stability and diversity in planning combinatorial libraries of chimeric proteins", Proc. RECOMB, 2008, pp. 300-314.

    In engineering protein variants by constructing and screening combinatorial libraries of chimeric proteins, two complementary and competing goals are desired: the new proteins must be similar enough to the evolutionarily-selected wild-type proteins to be stably folded, and they must be different enough to display functional variation. We present here the first method, Staversity, to simultaneously optimize stability and diversity in selecting sets of breakpoint locations for site-directed recombination. Our goal is to uncover all "undominated" breakpoint sets, for which no other breakpoint set is better in both factors. Our first algorithm finds the undominated sets serving as the vertices of the lower envelope of the two-dimensional (stability and diversity) convex hull containing all possible breakpoint sets. Our second algorithm identifies additional breakpoint sets in the concavities that are either undominated or dominated only by undiscovered breakpoint sets within a distance bound computed by the algorithm. Both algorithms are efficient, requiring only time polynomial in the numbers of residues and breakpoints, while characterizing a space defined by an exponential number of possible breakpoint sets. We applied Staversity to identify 2-10 breakpoint sets for three different sets of parent proteins from the purE family of biosynthetic enzymes. The average normalized distance between our plans and the lower bound for optimal plans is around 1 percent. Our plans dominate most 60-90% on average for each parent set) of the plans found by other possible approaches, random sampling or explicit optimization for stability with implicit optimization for diversity. The identified breakpoint sets provide a compact representation of good plans, enabling a protein engineer to understand and account for the trade-offs between two key considerations in combinatorial chimeragenesis.

    official version. preprint.

  • T.E. Williamson, B.A. Craig, E. Kondrashkina, C. Bailey-Kellogg, and A.M. Friedman, "Analysis of self-associating proteins by singular value decomposition of solution scattering data", Biophys. J., in press.

    We describe a method by which a single experiment can reveal both association model (pathway and constants) and low-resolution structures of a self-associating system. Small angle scattering data are collected from solutions at a range of concentrations. These scattering curves are mass-weighted linear combinations of the scattering from each oligomer. Singular value decomposition of the data yields a set of basis vectors, from which the scattering curve for each oligomer is reconstructed using coefficients that depend on the association model. A search identifies the association pathway and constants that provide the best agreement between reconstructed and observed data. Using simulated data with realistic noise, our method finds the correct pathway and association constants. Depending on the simulation parameters reconstructed curves for each oligomer differ from the ideal by 0.05-0.99% in median absolute relative deviation. The reconstructed scattering curves are fundamental to further analysis, including interatomic distance distribution calculation and low-resolution ab initio shape reconstruction of each oligomer in solution. This method can be applied to x-ray or neutron scattering data from small angles to modest (or higher) resolution. Data can be taken under physiological conditions or particular conditions (e.g. temperature) can be varied to extract fundamental association parameters (ΔHass, ΔSass).

  • L.V. Avramova, J. Desai, S. Weaver, A.M. Friedman, and C. Bailey-Kellogg, "Robotic hierarchical mixing for the production of combinatorial libraries of proteins and small molecules", J. Comb. Chem., 10:63-68, 2008.

    We present a method to automatically plan a robotic process to mix individual combinations of reactants in individual reaction vessels (vials or wells in a multi-well plate), mixing any number of reactants in any desired stoichiometry and ordering the mixing steps according to an arbitrarily complex tree-like assembly protocol. This process enables the combinatorial generation of complete or partial product libraries in individual reaction vessels from intermediates formed in the presence of different sets of reactants. It can produce either libraries of chimeric genes constructed by ligation of fragments from different parent genes, or libraries of chemical compounds constructed by convergent synthesis. Given concentrations of the input reactants and desired amounts or volumes of the products, our algorithm, RoboMix, computes the required reactant volumes and the resulting product concentrations, along with volumes and concentrations for all intermediate combinations. It outputs a sequence of robotic liquid transfer steps that ensures that each combination is correctly mixed even when individualized stoichiometries are employed and with any fractional yield for a product. It can also account for waste in robotic liquid handling and residual volume needed to ensure accurate aspiration. We demonstrate the effectiveness of the method in a test mixing dyes with different UV-Vis absorption spectra, verifying the desired combinations spectroscopically.

    paper.

  • F. Xiong and C. Bailey-Kellogg, "A hierarchical grow-and-match algorithm for backbone resonance assignments given 3D structure", Proc. IEEE BIBE, 2007, pp. 403-410.

    This paper develops an algorithm for NMR backbone resonance assignment given a 3D structure and a set of relatively sparse 15N-edited NMR data, with the through-space 15N-edited NOESY as the primary source of information. Our approach supports high-throughput solution studies of dynamics and interactions (e.g., ligand binding), when the structure has previously been determined by crystallography or modeled computationally. We employ a graph matching approach, identifying correspondence between a given contact graph and a corrupted version representing the NMR data. Our hierarchical grow-and-match algorithm decomposes the contact graph into sequential fragments with relatively dense interactions, and then combines possible assignments for the fragments, searching over the combinations with effective but conservative pruning. Our algorithm is complete, guaranteed to identify all solutions consistent with the data within a likelihood threshold of the optimal solution. It also deals correctly and uniformly with missing edges, which are quite common under this formulation. Tests on a number of experimental datasets and simulations with varying noise and sparsity demonstrate that our algorithm can handle significant data corruption (2.5-6.0 noisy edges per correct one) and sparsity (10-40% of the correct edges missing). In addition to the reference solution, the complete ensembles include a number (up to 30) of alternatives. We use these complete ensembles to characterize confidence in parts of an assignment.

    preprint.

  • W. Zheng, X. Ye, A.M. Friedman, and C. Bailey-Kellogg, "Algorithms for selecting breakpoint locations to optimize diversity in protein engineering by site-directed protein recombination", Proc. CSB, 2007, pp. 31-40.

    Protein engineering by site-directed recombination seeks to develop proteins with new or improved function, by accumulating multiple mutations from a set of homologous parent proteins. A library of hybrid proteins is created by recombining the parent proteins at specified breakpoint locations; subsequent screening/selection identifies hybrids with desirable functional characteristics. In order to improve the frequency of generating novel hybrids, this paper develops the first approach to explicitly plan for diversity in site-directed recombination, including metrics for characterizing the diversity of a planned hybrid library and efficient algorithms for optimizing experiments accordingly. The goal is to choose breakpoint locations to sample sequence space as uniformly as possible (which we argue maximizes diversity), under the constraints imposed by the recombination process and the given set of parents. A dynamic programming approach selects optimal breakpoint locations in polynomial time. Application of our method to optimizing breakpoints for an example biosynthetic enzyme, purE, demonstrates the significance of diversity optimization and the effectiveness of our algorithms.

    preprint.

  • J. Thomas, N. Ramakrishnan, and C. Bailey-Kellogg, "Graphical Models of Residue Coupling in Protein Families", IEEE/ACM Transactions on Computational Biology and Bioinformatics, in press.

    Many statistical measures and algorithmic techniques have been proposed for studying residue coupling in protein families. Generally speaking, two residue positions are considered coupled if, in the sequence record, some of their amino acid type combinations are significantly more common than others. While the proposed approaches have proven useful in finding and describing coupling, a significant missing component is a formal probabilistic model that explicates and compactly represents the coupling, integrates information about sequence, structure, and function, and supports inferential procedures for analysis, diagnosis, and prediction.

    We present an approach to learning and using probabilistic graphical models of residue coupling. These models capture significant conservation and coupling constraints observable in a multiply-aligned set of sequences. Our approach can place a structural prior on considered couplings, so that all identified relationships have direct mechanistic explanations. It can also incorporate information about functional classes, and thereby learn a differential graphical model that distinguishes constraints common to all classes from those unique to individual classes. Such differential models separately account for class-specific conservation and family-wide coupling, two different sources of sequence covariation. They are then able to perform interpretable functional classification of new sequences, explaining classification decisions in terms of the underlying conservation and coupling constraints. We apply our approach in studies of both G protein-coupled receptors and PDZ domains, identifying and analyzing family-wide and class-specific constraints, and performing functional classification. The results demonstrate that graphical models of residue coupling provide a powerful tool for uncovering, representing, and utilizing significant sequence-structure-function relationships in protein families.

    preprint.

  • X. Ye, A.M. Friedman, and C. Bailey-Kellogg, "Hypergraph model of multi-residue interactions in proteins: sequentially-constrained partitioning algorithms for optimization of site-directed protein recombination", J. Computational Biology, 2007, 14:777-790.

    Relationships among amino acids determine stability and function and are also constrained by evolutionary history. We develop a probabilistic hypergraph model of residue relationships that generalizes traditional pairwise contact potentials to account for the statistics of multi-residue interactions. Using this model, we detected non-random associations in protein families and in the protein database. We also use this model in optimizing site-directed recombination experiments to preserve significant interactions and thereby increase the frequency of generating useful recombinants. We formulate the optimization as a sequentially-constrained hypergraph partitioning problem; the quality of recombinant libraries with respect to a set of breakpoints is characterized by the total perturbation to edge weights. We prove this problem to be NP-hard in general, but develop exact and heuristic polynomial-time algorithms for a number of important cases. Application to the beta-lactamase family demonstrates the utility of our algorithms in planning site-directed recombination.

    official version. preprint.

  • S. Potluri, A.K. Yan, B.R. Donald, and C. Bailey-Kellogg, "A complete algorithm to resolve ambiguity for inter-subunit NOE assignment in structure determination of symmetric homo-oligomers", Protein Science, 2007, 16:69-81.

    Assignment of nuclear Overhauser effect (NOE) data is a key bottleneck in structure determination by nuclear magnetic resonance spectroscopy. NOE assignment resolves the ambiguity as to which pairs of protons generated the observed NOE peaks, and thus should be restrained in structure determination. In the case of inter-subunit NOEs in symmetric homo-oligomers, the ambiguity includes both the identities of the protons within a subunit, and the identities of the subunits to which they belong. This paper develops an algorithm to resolve ambiguity in inter-subunit NOE assignment and determine symmetric homo-oligomeric structures. Our algorithm is complete, in that it identifies structures representing, to within a user-defined similarity level, every structure consistent with the available data (ambiguous or not). However, while our approach is complete, it avoids explicit enumeration of the exponential number of combinations of possible assignments.

    In order to be complete yet efficient, we employ a configuration space framework. This allows us to represent the set of all possible structures in terms of symmetry axis parameters and all possible assignments of the NOEs in terms of atom and subunit identities. Our algorithm considers NOEs sequentially, i.e., one after the other, pruning regions of the symmetry axis configuration space and assignments that are mutually inconsistent. Pruning occurs only due to provable inconsistency and thereby avoids the pitfall of local minima that could arise from best-first sampling-based approaches. Ultimately, we return a mutually-consistent set of conformations and NOE assignments. Our algorithm can draw two types of conclusions not possible under existing methods: (a) that different assignments for an NOE would lead to different structural classes, or (b) that it is not necessary to assign an NOE since the choice of assignment would have little impact on structural precision. We demonstrate the effectiveness of our algorithm on two test cases: the homo-dimeric topological specificity domain of E. coli MinE and the homo-trimeric coiled-coil domain of chicken cartilage matrix protein (CCMP). Our method reduces the average number of possible assignments per NOE by a factor of 2.6 for MinE and 4.2 for CCMP. It results in high structural precision, reducing the average variance in atomic positions by a factor of 1.5 for MinE and 3.6 for CCMP.

    official version. preprint.

  • S. Potluri, A.K. Yan, J.J. Chou, B.R. Donald, and C. Bailey-Kellogg, "Structure determination of symmetric homo-oligomers by a complete search of symmetry configuration space using NMR restraints and van der Waals packing", Proteins, 2006, 65:203-219.

    Structural studies of symmetric homo-oligomers provide mechanistic insights into their roles in essential biological processes including cell signaling and cellular regulation. This paper presents a novel algorithm for homo-oligomeric structure determination, given the subunit structure, that is both complete, in that it evaluates all possible conformations and data-driven, in that it evaluates conformations separately for consistency with experimental data and for quality of packing. Completeness ensures that the algorithm does not miss the native conformation, and being data-driven enables it to assess the structural precision possible from data alone. Our algorithm performs a branch-and-bound search in the symmetry configuration space, the space of symmetry axis parameters (positions and orientations) defining all possible C_n homo-oligomeric complexes for a given subunit structure. It eliminates those symmetry axes inconsistent with inter-subunit nuclear Overhauser effect (NOE) distance restraints and then identifies conformations representing any consistent, well-packed structure to within a user-defined similarity level.

    For the human phospholamban pentamer in dodecylphosphocholine micelles, using the structure of one subunit determined from a subset of the experimental NMR data, our algorithm identifies a diverse set of complex structures consistent with the nine inter-subunit NOE restraints. The distribution of determined structures provides an objective characterization of structural uncertainty: backbone RMSD to the previously determined structure ranges from 1.07 Angstroms to 8.85 Angstroms, and variance in backbone atomic coordinates is an average of 12.32 Angstroms^2. Incorporating vdW packing reduces structural diversity to a maximum backbone RMSD of 6.24 Angstroms and an average backbone variance of 6.80 Angstroms^2. By comparing data consistency and packing quality under different assumptions of oligomeric number, our algorithm identifies the pentamer as the most likely oligomeric state of phospholamban, demonstrating that it is possible to determine the oligomeric number directly from NMR data. Additional tests on a number of homo-oligomers, from dimer to heptamer, similarly demonstrate the power of our method to provide unbiased determination and evaluation of homo-oligomeric complex structures.

    official version. preprint.

  • O. Vitek, C. Bailey-Kellogg, B. Craig, and J. Vitek, "Inferential backbone assignment for sparse data", J. Biomolecular NMR, 2006, 35:187-208.

    This paper develops an approach to protein backbone NMR assignment that effectively assigns large proteins while using limited sets of triple-resonance experiments. Our approach handles proteins with large fractions of missing data and many ambiguous pairs of pseudoresidues, and provides a statistical assessment of confidence in global and position-specific assignments. The approach is tested on an extensive set of experimental and synthetic data of up to 723 residues, with match tolerances of up to 0.5 ppm for CA and CB resonance types. The tests show that that the approach is particularly helpful when data contain experimental noise and require large match tolerances. The keys to the approach are an empirical Bayesian probability model that rigorously accounts for uncertainty in the data at all stages in the analysis, and a hybrid stochastic tree-based search algorithm that effectively explores the large space of possible assignments.

    official version.

  • L. Saftalov, P.A. Smith, A.M. Friedman, and C. Bailey-Kellogg, "Site-Directed Combinatorial Construction of Chimaeric Genes: General Method for Optimizing Assembly of Gene Fragments", Proteins, in press.

    Site-directed construction of chimaeric genes by in vitro recombination "mixes-and-matches" precise building blocks from multiple parent proteins, generating libraries of hybrids to be tested for structure-function relationships and/or screened for favorable properties and novel enzymatic activities. A direct annealing and ligation method can construct chimaeric genes without requiring sequence identity between parents, except for the short (approximately 3 nt) sequences of the fragment overhangs used for specific ligation. Careful planning of the assembly process is necessary, though, in order to ensure effective construction of desired fragment assemblies and to avoid undesired assemblies (e.g., repetition of fragments, fragments out of order).

    We develop algorithms for specific planned ligation (SPLISO) that efficiently explore possible assembly plans, varying the fragment overhangs and the order of ligation steps in the assembly pathway. While there is a combinatorial explosion in the number of possible assembly plans as the number of breakpoints and parent genes increases, we employ a dynamic programming approach to find globally optimal ones in low-order polynomial time (in practice, taking only seconds for basic assembly plans). We demonstrate the effectiveness of our algorithms in planning the assembly of hybrid libraries, under a variety of experimental options and restrictions, including flexibility in the position and amino acid sequence of breakpoints. Our method promises to enable more effective application of site-directed recombination to protein investigation and engineering.

    official version. preprint.

  • S. Potluri, A.K. Yan, J.J. Chou, B.R. Donald, and C. Bailey-Kellogg, "Extended Abstract: Structure determination of symmetric protein complexes by a complete search of symmetry configuration space using NMR distance restraints", Proc. WAFR, 2006.

    Structural studies of symmetric homo-oligomers (protein complexes with similar subunits arranged symmetrically) provide mechanistic insights into their roles in essential biological processes. We have developed a novel algorithm for homo-oligomeric structure determination that is both complete, in that it evaluates all possible conformations and data-driven, in that it evaluates conformations separately for consistency with experimental data and for quality of packing. Our algorithm determines homo-oligomeric structures by performing a branch-and-bound search in the symmetry configuration space, the space of symmetry axis parameters (positions and orientations) defining all possible C_n homo-oligomeric complexes for a fixed subunit structure. It eliminates those symmetry axes inconsistent with experimental inter-subunit distance restraints, and then identifies conformations representing any consistent, well-packed structure to within a user-defined similarity level. Our tests demonstrate the power of our method to provide an unbiased determination and evaluation of homo-oligomeric complex structures.

    preprint.

  • X. Ye, A.M. Friedman, and C. Bailey-Kellogg, "Hypergraph model of multi-residue interactions in proteins: sequentially-constrained partitioning algorithms for optimization of site-directed protein recombination", Proc. RECOMB, 2006, 15-29.

    Relationships among amino acids determine stability and function and are also constrained by evolutionary history. We develop a probabilistic hypergraph model of residue relationships that generalizes traditional pairwise contact potentials to account for the statistics of multi-residue interactions. Using this model, we detected non-random associations in protein families and in the protein database. We also use this model in optimizing site-directed recombination experiments to preserve significant interactions and thereby increase the frequency of generating useful recombinants. We formulate the optimization as a sequentially-constrained hypergraph partitioning problem; the quality of recombinant libraries wrt a set of breakpoints is characterized by the total perturbation to edge weights. We prove this problem to be NP-hard in general, but develop exact and heuristic polynomial-time algorithms for a number of important cases. Application to the beta-lactamase family demonstrates the utility of our algorithms in planning site-directed recombination.

    official version.

  • Z. Yi, O. Vitek, M. A. Qasim, S.M. Lu, W. Lu, M. Ranjbar, J. Li, M.C. Laskowski, C. Bailey-Kellogg, and M. Laskowski, Jr., "Functional evolution within a protein superfamily", Proteins, 2006, 63:697-708.

    The ability to predict and characterize distributions of reactivities over families and even superfamilies of proteins opens the door to an array of analyses regarding functional evolution. In this paper, insights into functional evolution in the Kazal inhibitor superfamily are gained by analyzing and comparing predicted association free energy distributions against six serine proteinases, over a number of groups of inhibitors: all possible Kazal inhibitors, natural avian ovomucoid first and third domains, and sets of Kazal inhibitors with statistically-weighted combinations of residues. The results indicate that, despite the great hypervariability of residues in the ten proteinase-binding positions, avian ovomucoid third domains evolved to inhibit enzymes similar to the six enzymes selected, while the orthologous first domains are not inhibitors of these enzymes on purpose. Hypervariability arises due to similarity in energetic contribution from multiple residue types; conservation is in terms of functionality, with "good" residues, which make positive or less deleterious contributions to the binding, selected more frequently and yielding overall the same distributional characteristics. Further analysis of the distributions indicates that while nature did optimize inhibitor strength, the objective may not have been the strongest possible inhibitor against one enzyme but rather an inhibitor that is relatively strong against a number of enzymes.

    preprint. official version.

  • H. Kamisetty, C. Bailey-Kellogg, and G. Pandurangan, "An efficient randomized algorithm for contact-based NMR backbone resonance assignment", Bioinformatics, 2006, 22:172-80.

    Motivation: Backbone resonance assignment is a critical bottleneck in studies of protein structure, dynamics, and interactions by nuclear magnetic resonance (NMR) spectroscopy. A minimalist approach to assignment, which we call "contact-based," seeks to dramatically reduce experimental time and expense by replacing the standard suite of through-bond experiments with the through-space (NOESY) experiment. In the contact-based approach, spectral data are represented in a graph with vertices for putative residues (of unknown relation to the primary sequence) and edges for hypothesized NOESY interactions, such that observed spectral peaks could be explained if the residues were "close enough." Due to experimental ambiguity, several incorrect edges can be hypothesized for each spectral peak. An assignment is derived by identifying consistent patterns of edges (e.g., for alpha-helices and beta-sheets) within a graph, and mapping the vertices to the primary sequence. The key algorithmic challenge is to be able to uncover these patterns even when they are obscured by significant noise.

    Results: This paper develops, analyzes, and applies a novel algorithm for the identification of polytopes representing consistent patterns of edges in a corrupted NOESY graph. Our randomized algorithm aggregates simplices into polytopes and fixes inconsistencies with simple local modifications, called rotations, that maintain most of the structure already uncovered. In characterizing the effects of experimental noise, we employ an NMR-specific random graph model in proving that our algorithm gives optimal performance in expected polynomial time, even when the input graph is significantly corrupted. We confirm this analysis in simulation studies with graphs corrupted by up to 500 percent noise. Finally, we demonstrate the practical application of the algorithm on several experimental beta-sheet data sets. Our approach is able to eliminate a large majority of noise edges and uncover large consistent sets of interactions.

    paper.

  • J. Thomas, N. Ramakrishnan, and C. Bailey-Kellogg, "Graphical models of residue coupling in protein families", Proc. BioKDD, 2005.

    Identifying residue coupling relationships within a protein family can provide important insights into the family's evolutionary record, and has significant applications in analyzing and optimizing sequence-structure-function relationships. We present the first algorithm to infer an undirected graphical model representing residue coupling in protein families. Such a model serves as a compact description of the joint amino acid distribution, focused on the independences among residues. This stands in contrast to current methods, which manipulate dense representations of co-variation and are focused on assessing dependence, which can conflate direct and indirect relationships. Our graphical model provides a sound basis for predictive (will this newly designed protein be folded and functional?), diagnostic (why is this protein not stable or functional?), and abductive reasoning (what if I attempt to graft features of one protein family onto another?). Further, our algorithm can readily incorporate, as priors, hypotheses regarding possible underlying mechanistic/energetic explanations for coupling. The resulting approach constitutes a powerful and discriminatory mechanism to identify residue coupling from protein sequences and structures. Analysis results on the G-protein coupled receptor (GPCR) and PDZ domain families demonstrate the ability of our approach to effectively uncover and exploit models of residue coupling.

    pdf (Copyright 2005, ACM).

  • O. Vitek, C. Bailey-Kellogg, B. Craig, P. Kuliniewicz, and J. Vitek, "Reconsidering complete search algorithms for protein backbone NMR assignment," Proc. ECCB, published in Bioinformatics, 2005, 21:ii230-236.

    Motivation: Nuclear magnetic resonance (NMR) spectroscopy is widely used to determine and analyze protein structures. An essential step in NMR studies is determining the backbone resonance assignment, which maps individual atoms to experimentally measured resonance frequencies. Performing assignment is challenging due to the noise and ambiguity in NMR spectra. While automated procedures have been investigated, by-and-large they are still struggling to gain acceptance due to inherent limits in scalability and/or unacceptable levels of assignment error.

    To have confidence in the results, an algorithm should be complete, i.e., able to identify all solutions consistent with the data, including all arbitrary configurations of extra and missing peaks. The ensuing combinatorial explosion in the space of possible assignments has led to the perception that complete search is hopelessly inefficient and cannot scale to realistic data sets.

    Results: This paper presents a complete branch-contract-and-bound search algorithm for backbone resonance assignment. The algorithm controls the search space by hierarchically agglomerating partial assignments and employing statistically sound pruning criteria. It considers all solutions consistent with the data, and uniformly treats all combinations of extra and missing data.

    We demonstrate our approach on experimental data from 5 proteins ranging in size from 70 to 154 residues. The algorithm assigns over 95% of the positions with over 98% accuracy. We also present results on simulated data from 259 proteins from the RefDB database, ranging in size from 25 to 257 residues. The median computation time for these cases is 1 minute, and the assignment accuracy is over 99%.

    These results demonstrate that complete search not only has the advantage of guaranteeing fair treatment of all feasible solutions, but is efficient enough to be employed effectively in practice.

    pdf.

  • N. Ramakrishnan, C. Bailey-Kellogg, S. Tadepalli, and V.N. Pandey, "Gaussian Processes for Active Data Mining of Spatial Aggregates", Proc. SIAM Data Mining Conference, 2005.

    Active data mining is becoming prevalent in applications requiring focused sampling of data relevant to a high-level mining objective. It is especially pertinent in scientific and engineering applications where we seek to characterize a configuration space or design space in terms of spatial aggregates, and where data collection can become costly. Examples abound in domains such as aircraft design, wireless system simulation, fluid dynamics, and sensor networks. This paper develops an active mining mechanism, using Gaussian processes, for uncovering spatial aggregates from only a sparse set of targeted samples. Gaussian processes provide a unifying framework for building surrogate models from sparse data, reasoning about the uncertainty of estimation at unsampled points, and formulating objective criteria for closing-the-loop between data collection and data mining. Our mechanism optimizes sample selection using entropy-based functionals defined over spatial aggregates instead of the traditional approach of sampling to minimize estimated variance. We apply this mechanism on a global optimization benchmark comprising a testbank of 2D functions, as well as on data from wireless system simulations. The results reveal that the proposed sampling strategy makes more judicious use of data points by selecting locations that clarify high-level structures in data, rather than choosing points that merely improve quality of function approximation.

    pdf.

  • J. Li, Z.-P. Yi, M.C. Laskowski, M. Laskowski Jr., and C. Bailey-Kellogg, "Analysis of sequence-reactivity space for protein-protein interactions", Proteins, 2005, 58:661-671.

    Sequence-reactivity space is defined by the relationships between amino acid type choices at some residue positions in a protein and the reactivities of the resulting variants. We are studying Kazal superfamily serine proteinase inhibitors, under substitution of any combination of residue types at ten binding-region positions. Reactivities are defined by the standard free energy of association for an inhibitor against an enzyme, and we are interested in both the strength (the free energy value) and specificity (relative free energy values for one inhibitor against different enzymes). Characterizing the structure of such a space poses several interesting questions: (1) how many sequences achieve particular strength and specificity characteristics? (2) what is the best such sequence? (3) what are some nearly-as-good alternatives? (4) what are their common residue type characteristics (e.g. conservation and correlation)? Although these problems are all highly combinatorial in nature, this paper develops an efficient, integrated mechanism to address them under a data-driven model that predicts reactivity for given sequences. We employ sampling and a novel deterministic distribution propagation algorithm, in order to determine both the reactivity distribution and sequence composition statistics; integer programming and a novel branch-and-bound search algorithm, in order to optimize sequences and enumerate near-optimal sequences; and correlation-based sequence decomposition, in order to identify sequence motifs. We demonstrate the value of our mechanism in analyzing the Kazal superfamily sequence-reactivity space, providing insights into the underlying biochemistry and suggesting hypotheses for further experimental consideration. In general, our mechanism offers a valuable tool for investigating the available degrees of freedom in protein design within a combined computational-experimental framework.

    paper.

  • X. Ye, P.K. O'Neil, A.N. Foster, M.J. Gajda, J. Kosinski, M.A. Kurowski, A.M. Friedman, and C. Bailey-Kellogg, "Probabilistic cross-link analysis and experiment planning for high-throughput elucidation of protein structure", Protein Sci., 2004, 13:3298-3313.

    Emerging high-throughput techniques for the characterization of protein and protein complex structures yield noisy data with sparse information content, placing a significant burden on computation to properly interpret the experimental data. One such technique employs cross-linking (chemical or by cysteine oxidation) to confirm or select among proposed structural models (e.g. from fold recognition, ab initio prediction, or docking) by testing the consistency between cross-linking data and model geometry. This paper develops a probabilistic framework for analyzing the information content in cross-linking experiments, accounting for anticipated experimental error. This framework supports a mechanism for planning experiments to optimize the information gained. We evaluate potential experiment plans using explicit trade-offs among key properties of practical importance --- discriminability, coverage, balance, ambiguity, and cost. We devise a greedy algorithm that considers those properties and, from a large number of combinatorial possibilities, rapidly selects sets of experiments expected to efficiently discriminate pairs of models. In an application to residue-specific chemical cross-linking, we demonstrate the ability of our approach to effectively plan experiments involving combinations of cross-linkers and introduced mutations. We also describe an experiment plan for the bacteriophage lambda Tfa chaperone protein in which we plan dicysteine mutants for discriminating threading models by disulfide formation. Preliminary results from a subset of the planned experiments are consistent and demonstrate the practicality of planning. Our methods provide the experimenter with a valuable tool (available from the authors) for understanding and optimizing cross-linking experiments.

    paper.

  • A. Kuzminykh and C. Bailey-Kellogg, "Polyhedral approximations of the sphere with unusual geometric and topological properties", Geombinatorics, 2004, 14(2):55-61.

    In the present paper we would like to draw attention to some unusual geometric and topological properties that polyhedral approximations of even so simple a surface as a 2-dimensional sphere may have. We prove the existence of infinitely many polyhedral surfaces (with vertices belonging to the sphere) such that the polytopes bounded by them are mutually disjoint and, for each epsilon > 0, one can find, among these surfaces, any number of congruent surfaces of any genus, each of which is an epsilon-approximation of the sphere and has area less than epsilon. Such approximations may be interesting not only from a geometric point of view but also as an illustration of possible geometric and topological difficulties for surface polyhedral approximation algorithms.

  • N. Ramakrishnan, C. Bailey-Kellogg, S. Tadepalli, and V.N. Pandey, "Gaussian Processes for Active Data Mining of Spatial Aggregates", Proc. 18th International Workshop on Qualitative Reasoning, Chicago, 2004.

    We present an active data mining mechanism for qualitative analysis of spatial datasets, integrating identification and analysis of structures in spatial data with targeted collection of additional samples. The mechanism is designed around the spatial aggregation language (SAL) for qualitative spatial reasoning, and seeks to uncover high-level spatial structures from only a sparse set of samples. This approach is important for applications in domains such as aircraft design, wireless system simulation, fluid dynamics, and sensor networks. The mechanism employs Gaussian processes, a formal mathematical model for reasoning about spatial data, in order to build surrogate models from sparse data, reason about the uncertainty of estimation at unsampled points, and formulate objective criteria for closing-the-loop between data collection and data analysis. It optimizes sample selection using entropy-based functionals defined over spatial aggregates instead of the traditional approach of sampling to minimize estimated variance. We apply this mechanism on a global optimization benchmark comprising a testbank of 2D functions, as well as on data from wireless system simulations. The results reveal that the proposed sampling strategy makes more judicious use of data points by selecting locations that clarify high-level structures in data, rather than choosing points that merely improve quality of function approximation.

    pdf

  • C. Bailey-Kellogg and N. Ramakrishnan, "Spatial Aggregation for Qualitative Assessment of Scientific Computations", Proc. AAAI, 2004.

    Qualitative assessment of scientific computations is an emerging application area that applies a data-driven approach to characterize, at a high level, phenomena including conditioning of matrices, sensitivity to various types of error propagation, and algorithmic convergence behavior. This paper develops a spatial aggregation approach that formalizes such analysis in terms of model selection utilizing spatial structures extracted from matrix perturbation datasets. We focus in particular on the characterization of matrix eigenstructure, both analyzing sensitivity of computations with spectral portraits and determining eigenvalue multiplicity with Jordan portraits. Our approach employs spatial reasoning to overcome noise and sparsity by detecting mutually reinforcing interpretations, and to guide subsequent data sampling. It enables quantitative evaluation of properties of a scientific computation in terms of confidence in a model, explainable in terms of the sampled data and domain knowledge about the underlying mathematical structure. Not only is our methodology more rigorous than the common approach of visual inspection, but it also is often substantially more efficient, due to well-defined stopping criteria. Results show that the mechanism efficiently samples perturbation space and successfully uncovers high-level properties of matrices.

    pdf (Copyright 2004, American Association for Artificial Intelligence (www.aaai.org)).

  • R.H. Lilien, C. Bailey-Kellogg, A.A. Anderson, and B.R. Donald, "A subgroup algorithm to identify cross-rotation peaks consistent with non-crystallographic symmetry", Acta Crystallographica D: Biological Crystallography, 2004, D60:1057-1067.

    Molecular replacement (MR) often plays a prominent role in determining initial phase angles for structure determination by X-ray crystallography. In this paper, an efficient quaternion-based algorithm is presented for analyzing peaks from a cross-rotation function to identify model orientations consistent with non-crystallographic symmetry (NCS), and to generate NCS-consistent orientations missing from the list of cross-rotation peaks. Our algorithm, CRANS, analyzes the rotation differences between each pair of cross-rotation peaks to identify finite subgroups of NCS. Sets of rotation differences satisfying the subgroup axioms correspond to orientations compatible with the correct NCS. The CRANS algorithm was first tested using cross-rotation peaks computed from structure factor data for three test systems, and then used to assist in the de novo structure determination of dihydrofolate reductase-thymidylate synthase (DHFR-TS) from Cryptosporidium hominis. In every case, the CRANS algorithm runs in seconds to identify orientations consistent with the observed NCS and to generate missing orientations not present in the cross-rotation peak list. The CRANS algorithm has application in every molecular replacement phasing effort with NCS.

    paper or local pdf (Copyright © International Union of Crystallography).

  • O. Vitek, J. Vitek, B. Craig, and C. Bailey-Kellogg, "Model-based assignment and inference of protein backbone nuclear magnetic resonances", Statistical Applications in Genetics and Molecular Biology, 2004, 3(1), article 6, 1-33.

    Nuclear Magnetic Resonance (NMR) spectroscopy is a key experimental technique used to study protein structure, dynamics, and interactions. NMR methods face the bottleneck of spectral analysis, in particular determining the resonance assignments, which help define the mapping between atoms in the protein and peaks in the spectra. A substantial amount of noise in spectral data, along with ambiguities in interpretation, make this analysis a daunting task, and there exists no generally accepted measure of uncertainty associated with the resulting solutions. This paper develops a model-based inference approach that addresses the problem of characterizing uncertainty in backbone resonance assignment. We argue that NMR spectra are subject to random variation, and ignoring this stochasticity can lead to false optimism and erroneous conclusions. We propose a Bayesian statistical model that accounts for various sources of uncertainty and provides an automatable framework for inference. While assignment has previously been viewed as a deterministic optimization problem, we demonstrate the importance of considering all solutions consistent with the data, and develop an algorithm to search this space within our statistical framework. Our approach is able to characterize the uncertainty associated with backbone resonance assignment in several ways: 1) it quantifies of uncertainty in the individually assigned resonances in terms of their posterior standard deviations; 2) it assesses the information content in the data with a posterior distribution of plausible assignments; and 3) it provides a measure of the overall plausibility of assignments. We demonstrate the value of our approach in a study of experimental data from two proteins, Human Ubiquitin and Cold-shock protein A from E. coli. In addition, we provide simulations showing the impact of experimental conditions on uncertainty in the assignments.

    paper. local pdf.

  • C. Bailey-Kellogg, S. Chainraj, and G. Pandurangan, "A Random Graph Approach to NMR Sequential Assignment", Proc. RECOMB, 2004, pp. 58-67.

    Nuclear magnetic resonance (NMR) spectroscopy allows scientists to study protein structure, dynamics, and interactions in solution. A necessary first step for such applications is determining the resonance assignment, mapping spectral data to atoms and residues in the primary sequence. Automated resonance assignment algorithms rely on information regarding connectivity (e.g. through-bond atomic interactions) and amino acid type, typically using the former to determine strings of connected residues and the latter to map those strings to positions in the primary sequence. Significant ambiguity exists in both connectivity and amino acid type, and different algorithms have combined the information in two phases (find short unambiguous strings then align) or simultaneously (align while extending strings). This paper focuses on the information content available in connectivity alone, allowing for ambiguity rather than handling only unambiguous strings, and complements existing work on the information content in amino acid type.

    In this paper, we develop a novel random-graph theoretic framework for algorithmic analysis of NMR sequential assignment. Our random graph model captures the structure of chemical shift degeneracy (a key source of connectivity ambiguity). We then give a simple and natural randomized algorithm for finding an optimum sequential cover. The algorithm naturally and efficiently reuses substrings while exploring connectivity choices; it overcomes local ambiguity by enforcing global consistency of all choices. We employ our random graph model to analyze our algorithm, and show that it can provably tolerate a relatively large ambiguity while still giving expected optimal performance in polynomial time. To study the algorithm's performance in practice, we tested it on experimental data sets from a variety of proteins and experimental set-ups. The algorithm was able to overcome significant noise and local ambiguity and consistently identify significant sequential fragments.

    pdf (Copyright 2004, ACM).

  • C. Bailey-Kellogg and F. Zhao, "Qualitative spatial reasoning: extracting and reasoning with spatial aggregates", AI Magazine, 2004, 24:47-60.

    Reasoning about spatial data is a key task in many applications, including geographic information systems, meteorological and fluid flow analysis, computer-aided design, and protein structure databases. Such applications often require the identification and manipulation of qualitative spatial representations, for example, to detect whether one "object" will soon occlude another in a digital image, or to efficiently determine relationships between a proposed road and wetland regions in a geographic data set. Qualitative spatial reasoning (QSR) provides representational primitives (a spatial "vocabulary") and inference mechanisms for these tasks. This paper first reviews representative work on QSR for data-poor scenarios, where the goal is to design representations that can answer qualitative queries without much numerical information. It then turns to the data-rich case, where the goal is to derive and manipulate qualitative spatial representations that efficiently and correctly abstract important spatial aspects of the underlying data, for use in subsequent tasks. This paper focuses on how a particular QSR system, Spatial Aggregation (SA), can help answer spatial queries for scientific and engineering data sets. A case study application of weather analysis illustrates the effective representation and reasoning supported by both data-poor and data-rich forms of QSR.

    pdf (Copyright 2004, American Association for Artificial Intelligence (www.aaai.org)).

  • S. Potluri, A.A. Khan, A. Kuzminykh, J.M. Bujnicki, A.M. Friedman, and C. Bailey-Kellogg, "Geometric analysis of cross-linkability for protein fold discrimination", Proc. Pacific Symposium on Biocomputing (PSB), 2004.

    Protein structure provides insight into the evolutionary origins, functions, and mechanisms of proteins. We are pursuing a minimalist approach to protein fold identification that characterizes possible folds in terms of consistency of their geometric features with restraints derived from relatively cheap, high-throughput experiments. One such experiment is residue-specific cross-linking analyzed by mass spectrometry. This paper presents a suite of novel lower- and upper-bounding algorithms for analyzing the distance between surface cross-link sites and thereby validating predicted models against experimental cross-linking results. Through analysis and computational experiments, using simulated and published experimental data, we demonstrate that our algorithms enable effective model discrimination.

    pdf

  • C. Bailey-Kellogg and N. Ramakrishnan, "Active data mining of correspondence for qualitative assessment of scientific computations," Proc. 17th International Workshop on Qualitative Reasoning, Brasilia, 2003, pp. 23-30.

    Active data mining constructs and evaluates possible models explaining a dataset, and reasons about the cost and impact of additional samples on refining and selecting among the models. It is particularly appropriate for applications characterized by expensive data collection, from either experiment or simulation. This paper develops an active mining mechanism based on a multi-level, qualitative analysis of correspondence. Correspondence operators presented here leverage domain knowledge to establish relationships among objects, evaluate implications for model selection, and leverage identified weaknesses to focus additional data collection. The utility of the qualitative framework is demonstrated in two scientific computing applications -- matrix spectral portrait analysis and graphical assessment of Jordan forms of matrices. Results show that the mechanism efficiently samples computational experiments and successfully uncovers high-level properties of data. The framework helps overcome noise and sparsity by leveraging domain knowledge to detect mutually reinforcing interpretations of spatial data.

    pdf

  • N. Ramakrishan and C. Bailey-Kellogg, "Gaussian process models of spatial aggregation algorithms", Proc. Int'l Joint Conf. on Artificial Intelligence (IJCAI), 2003, pp. 1045-1051.

    Multi-level spatial aggregates are important for data mining in a variety of scientific and engineering applications, from analysis of weather data (aggregating temperature and pressure data into ridges and fronts) to performance analysis of wireless systems (aggregating simulation results into configuration space regions exhibiting particular performance characteristics). In many of these applications, data collection is expensive and time consuming, so effort must be focused on gathering samples at locations that will be most important for the analysis. This requires that we be able to functionally model a data mining algorithm in order to assess the impact of potential samples on the mining of suitable spatial aggregates. This paper describes a novel Gaussian process approach to modeling multi-layer spatial aggregation algorithms, and demonstrates the ability of the resulting models to capture the essential underlying qualitative behaviors of the algorithms. By helping cast classical spatial aggregation algorithms in a rigorous quantitative framework, the Gaussian process models support diverse uses such as directed sampling, characterizing the sensitivity of a mining algorithm to particular parameters, and understanding how variations in input data fields percolate up through a spatial aggregation hierarchy.

    pdf (Copyright 2003, IJCAI (www.ijcai.org)).

  • F. Zhao, C. Bailey-Kellogg, and M. Fromherz, "Physics-based encapsulation in embedded software for distributed sensing and control applications," Proceedings of the IEEE, 2003, 91:40-63.

    Spatial Aggregation abstracts data arising from distributed embedded sensing and control applications as a set of so-called "spatio-temporal objects." Locality and continuity in the underlying physics of a problem domain give rise to spatially coherent and temporally contiguous objects in an appropriate metric space. Once parameterized by physical properties such as location, intensity (e.g., light, temperature, pressure), and motion (e.g., velocity), these objects can be aggregated and abstracted into more abstract descriptions. Applications are written as the creation and transformation of these abstract objects.

    We illustrate how these objects naturally arise from applications such as distributed sensing and actuation and use an air-jet table system to demonstrate how such a physics-based encapsulation modularizes the design of sensing and control software. Unlike in traditional software design, where objects and operations are defined mathematically and possess a semantics independent of possible implementations, the objects in distributed embedded software are defined by the physics of the application, algorithmic considerations, task requirements, as well as optimization criteria. The air-jet table example demonstrates that the grouping and abstraction of actuation devices are determined by laws of motion, the type of force allocation algorithms used, and the desired performance of the controller; this encapsulation greatly simplifies the design and implementation of a force allocation algorithm for the system and improves software modularity. Based on our practical experiences in designing several massively distributed sensing and actuation systems, we present a set of recommendations for distributed embedded software modeling and design.

    postscript

  • N. Ramakrishnan and C. Bailey-Kellogg, "Using physical properties for mining in data-scarce domains," IEEE Computing in Science & Engineering, 2002, 4:31-43.

    Data mining has traditionally been motivated by the desire to make inferences from vast repositories of data. However, many scientific domains are characterized by a scarcity of data, rather than abundance. This is especially true when data is being generated from costly computer simulations. In such data-scarce domains, it is advantageous to use a data collection and sampling strategy that focuses on only the most interesting regions, from a data mining perspective. In this paper, we describe how to design sampling strategies for data-scarce domains by exploiting knowledge of physical properties. The physical properties we concentrate on include continuity, correspondence, and locality and are especially relevant in spatial data interpretation applications. The methodology resulting from this approach is demonstrated in two diverse applications -- mining pockets in spatial data, and qualitative determination of Jordan forms of matrices.

    pdf

  • C. Bailey-Kellogg and F. Zhao, "Influence-based model decomposition," Artificial Intelligence, 2001, 130:125-166.

    Many important science and engineering applications, such as regulating the temperature distribution over a semiconductor wafer and controlling the noise from a photocopy machine, require interpreting distributed data and designing decentralized controllers for spatially distributed systems. Developing effective computational techniques for representing and reasoning about these systems, which are usually modeled with partial differential equations (PDEs), is one of the major challenge problems for qualitative and spatial reasoning research.

    This paper introduces a novel approach to decentralized control design, influence-based model decomposition, and applies it in the context of thermal regulation. Influence-based model decomposition uses a decentralized model, called an influence graph, as a key data abstraction representing influences of controls on distributed physical fields. It serves as the basis for novel algorithms for control placement and parameter design for distributed systems with large numbers of coupled variables. These algorithms exploit physical knowledge of locality, linear superposability, and continuity, encapsulated in influence graphs representing dependencies of field nodes on control nodes. The control placement design algorithms utilize influence graphs to decompose a problem domain so as to decouple the resulting regions. The decentralized control parameter optimization algorithms utilize influence graphs to efficiently evaluate thermal fields and to explicitly trade off computation, communication, and control quality. By leveraging the physical knowledge encapsulated in influence graphs, these control design algorithms are more efficient than standard techniques, and produce designs explainable in terms of problem structures.

    pdf

  • C. Bailey-Kellogg and N. Ramakrishnan, "Ambiguity-directed sampling for qualitative analysis of sparse data from spatially-distributed physical systems," Proc. Int'l Joint Conf. on Artificial Intelligence (IJCAI), 2001, pp. 43-50.

    A number of important scientific and engineering applications, such as fluid dynamics simulation and aircraft design, require analysis of spatially-distributed data from expensive experiments and complex simulations. In such data-scarce applications, it is advantageous to use models of given sparse data to identify promising regions for additional data collection. This paper presents a principled mechanism for applying domain-specific knowledge to design focused sampling strategies. In particular, our approach uses ambiguities identified in a multi-level qualitative analysis of sparse data to guide iterative data collection. Two case studies demonstrate that this approach leads to highly effective sampling decisions that are also explainable in terms of problem structures and domain knowledge.

    pdf (Copyright 2001, IJCAI (www.ijcai.org)).

  • C. Bailey-Kellogg, J.J. Kelley, III, R. Lilien, and B.R. Donald, "Physical geometric algorithms for structural molecular biology," invited paper at the Special Session on Computational Biology and Chemistry, International Conference on Robotics and Automation (ICRA), 2001.

    A wealth of interesting computational problems arises in proposed methods for discovering new pharmaceuticals. This paper surveys our recent work in three key areas, using a Physical Geometric Algorithm (PGA) approach to data interpretation, experiment planning, and drug design:

    1. Data-directed computational protocols for high-throughput protein structure determination. A key component of structure determination through nuclear magnetic resonance (NMR) is that of assigning spectral peaks. We are developing a novel approach, called Jigsaw, to automated secondary structure determination and main-chain assignment. Jigsaw consists of two main components: graph-based secondary structure pattern identification in unassigned heteronuclear (N15-labeled) NMR data, and assignment of spectral peaks by probabilistic alignment of identified secondary structure elements against the primary sequence.
    2. Experiment planning and data interpretation algorithms for reducing mass degeneracy in mass spectrometry (MS). MS offers many advantages for high-throughput assays (e.g. small sample size and large mass limits), but it faces the potential problem of mass degeneracy -- indistinguishable masses for multiple biopolymer fragments (e.g. from a limited proteolytic digest). We are studying the use of selective isotopic labeling to substantially reduce potential mass degeneracy, especially in the context of structural determination of protein-protein and protein-DNA complexes.
    3. Computer-aided drug design (CADD). We are developing new CADD tools and applying them to the design of an inhibitor for the Core-Binding Factor-Beta oncoprotein (CBF-Beta-MYH11), a fusion protein involved in some forms of Acute Myelomonocytic Leukemia (AMML). Computational-structural studies of CBF help determine the molecular basis for its function and assist in the development of therapeutic strategies. A key issue in such studies is geometric modeling of protein flexibility; our approach attempts to account for flexibility by using an ensemble of structures representing low-energy conformations as determined by solution NMR.

    Our long-range goal is the structural and functional understanding of biopolymer interactions in systems of significant biochemical as well as pharmacological interest. The research overviewed here represents a set of important steps towards that goal.

    pdf

  • C. Bailey-Kellogg, J.J. Kelley, III, C. Stein, and B.R. Donald, "Reducing Mass Degeneracy in SAR by MS by Stable Isotopic Labeling," J. Computational Biology, 2001, 8:19-36.

    Mass spectrometry (MS) promises to be an invaluable tool for functional genomics, by supporting low-cost, high-throughput experiments. However, large-scale MS faces the potential problem of mass degeneracy -- indistinguishable masses for multiple biopolymer fragments (e.g. from a limited proteolytic digest). This paper studies the tasks of planning and interpreting MS experiments that use selective isotopic labeling, thereby substantially reducing potential mass degeneracy. Our algorithms support an experimental-computational protocol called Structure-Activity Relation by Mass Spectrometry (SAR by MS), for elucidating the function of protein-DNA and protein-protein complexes. SAR by MS enzymatically cleaves a crosslinked complex and analyzes the resulting mass spectrum for mass peaks of hypothesized fragments. Depending on binding mode, some cleavage sites will be shielded; the absence of anticipated peaks implicates corresponding fragments as either part of the interaction region or inaccessible due to conformational change upon binding. Thus different mass spectra provide evidence for different structure-activity relations. We address combinatorial and algorithmic questions in the areas of data analysis (constraining binding mode based on mass signature) and experiment planning (determining an isotopic labeling strategy to reduce mass degeneracy and aid data analysis). We explore the computational complexity of these problems, obtaining upper and lower bounds. We report experimental results from implementations of our algorithms.

    pdf

  • C. Bailey-Kellogg, A. Widge, M.J. Berardi, J.H. Bushweller, and B.R. Donald, "The NOESY Jigsaw: automated protein secondary structure and main-chain assignment from sparse, unassigned NMR data," J. Computational Biology, 2000, 7:537-558.

    High-throughput, data-directed computational protocols for Structural Genomics (or Proteomics) are required in order to evaluate the protein products of genes for structure and function at rates comparable to current gene-sequencing technology. This paper presents the Jigsaw algorithm, a novel high-throughput, automated approach to protein structure characterization with nuclear magnetic resonance (NMR). Jigsaw applies graph algorithms and probabilistic reasoning techniques, enforcing first-principles consistency rules in order to overcome a 5-10% signal-to-noise ratio. It consists of two main components: (1) graph-based secondary structure pattern identification in unassigned heteronuclear NMR data, and (2) assignment of spectral peaks by probabilistic alignment of identified secondary structure elements against the primary sequence. Deferring assignment eliminates the bottleneck faced by traditional approaches, which begin by correlating peaks among dozens of experiments. Jigsaw utilizes only four experiments, none of which requires 13C-labeled protein, thus dramatically reducing both the amount and expense of wet lab molecular biology and the total spectrometer time. Results for three test proteins demonstrate that Jigsaw correctly identifies 79-100% of alpha-helical and 46-65% of beta-sheet NOE connectivities, and correctly aligns 33-100% of secondary structure elements. Jigsaw is very fast, running in minutes on a Pentium-class Linux workstation. This approach yields quick and reasonably accurate (as opposed to the traditional slow and extremely accurate) structure calculations. It could be useful for quick structural assays to speed data to the biologist early in an investigation, and could in principle be applied in an automation-like fashion to a large fraction of the proteome.

    pdf

  • C. Bailey-Kellogg, J.J. Kelley, III, C. Stein, and B.R. Donald, "Reducing Mass Degeneracy in SAR by MS by Stable Isotopic Labeling," Proc. ISMB, 2000, pp. 13-24.

    Mass spectrometry (MS) promises to be an invaluable tool for functional genomics, by supporting low-cost, high-throughput experiments. However, large-scale MS faces the potential problem of mass degeneracy -- indistinguishable masses for multiple biopolymer fragments (e.g. from a limited proteolytic digest). This paper studies the tasks of planning and interpreting MS experiments that use selective isotopic labeling, thereby substantially reducing potential mass degeneracy. Our algorithms support an experimental-computational protocol called Structure-Activity Relation by Mass Spectrometry (SAR by MS), for elucidating the function of protein-DNA and protein-protein complexes. SAR by MS enzymatically cleaves a crosslinked complex and analyzes the resulting mass spectrum for mass peaks of hypothesized fragments. Depending on binding mode, some cleavage sites will be shielded; the absence of anticipated peaks implicates corresponding fragments as either part of the interaction region or inaccessible due to conformational change upon binding. Thus different mass spectra provide evidence for different structure-activity relations. We address combinatorial and algorithmic questions in the areas of data analysis (constraining binding mode based on mass signature) and experiment planning (determining an isotopic labeling strategy to reduce mass degeneracy and aid data analysis). We explore the computational complexity of these problems, obtaining upper and lower bounds. We report experimental results from implementations of our algorithms.

    postscript

  • C. Bailey-Kellogg, A. Widge, M.J. Berardi, J.H. Bushweller, and B.R. Donald, "The NOESY Jigsaw: automated protein secondary structure and main-chain assignment from sparse, unassigned NMR data," Proc. RECOMB, 2000, pp. 33-44.

    High-throughput, data-directed computational protocols for Structural Genomics (or Proteomics) are required in order to evaluate the protein products of genes for structure and function at rates comparable to current gene-sequencing technology. This paper presents the Jigsaw algorithm, a novel high-throughput, automated approach to protein structure characterization with nuclear magnetic resonance (NMR). Jigsaw applies graph algorithms and probabilistic reasoning techniques, enforcing first-principles consistency rules in order to overcome a 5-10% signal-to-noise ratio. It consists of two main components: (1) graph-based secondary structure pattern identification in unassigned heteronuclear NMR data, and (2) assignment of spectral peaks by probabilistic alignment of identified secondary structure elements against the primary sequence. Deferring assignment eliminates the bottleneck faced by traditional approaches, which begin by correlating peaks among dozens of experiments. Jigsaw utilizes only four experiments, none of which requires cthir-labeled protein, thus dramatically reducing both the amount and expense of wet lab molecular biology and the total spectrometer time. Results for three test proteins demonstrate that Jigsaw correctly identifies 79-100% of ahelical and 46-65% of bsheet NOE connectivities, and correctly aligns 33-100% of secondary structure elements. Jigsaw is very fast, running in minutes on a Pentium-class Linux workstation. This approach yields quick and reasonably accurate (as opposed to the traditional slow and extremely accurate) structure calculations. It could be useful for quick structural assays to speed data to the biologist early in an investigation, and could in principle be applied in an automation-like fashion to a large fraction of the proteome.

    postscript

  • C. Bailey-Kellogg and F. Zhao, "Influence-based model decomposition," Proc. AAAI, 1999, pp. 402-409.

    Recent rapid advances in MEMS and information processing technology have enabled a new generation of AI robotic systems -- so-called Smart Matter systems -- that are sensor rich and physically embedded. These systems range from decentralized control systems that regulate building temperature (smart buildings) to vehicle on-board diagnostic and control systems that interrogate large amounts of sensor data. One of the core tasks in the construction and operation of these Smart Matter systems is to synthesize optimal control policies using data rich models for the systems and environment. Unfortunately, these models may contain thousands of coupled real-valued variables and are prohibitively expensive to reason about using traditional optimization techniques such as neural nets and genetic algorithms. This paper introduces a general mechanism for automatically decomposing a large model into smaller subparts so that these subparts can be separately optimized and then combined. The mechanism decomposes a model using an influence graph that records the coupling strengths among constituents of the model. This paper demonstrates the mechanism in an application of decentralized optimization for a temperature regulation problem. Performance data has shown that the approach is much more efficient than the standard discrete optimization algorithms and achieves comparable accuracy.

    postscript (Copyright 1999, American Association for Artificial Intelligence (www.aaai.org)).

  • C. Bailey-Kellogg, "The spatial aggregation language for modeling and controlling distributed physical systems," PhD thesis, Ohio State University, 1999.

    Many important science and engineering applications, such as predicting weather patterns, controlling the temperature distribution over a semiconductor wafer, and controlling the noise of a photocopy machine, require interpreting data and designing decentralized controllers for spatially distributed systems. This thesis describes the Spatial Aggregation Language (SAL), a novel programming language and environment supporting data interpretation and control tasks for distributed physical systems. SAL provides a set of powerful, high-level components that make explicit use of domain-specific physical knowledge, such as metrics, adjacency relations, and equivalence predicates, in order to uncover and exploit structures in distributed physical data at multiple levels of abstraction. The language data types and operators manipulate structured representations of spatial objects in distributed physical systems at multiple levels of abstraction. The programming environment supports rapid prototyping of application programs and interactive manipulation of the resulting structures. In comparison with existing tools, the Spatial Aggregation Language offers high level programming abstractions explicitly encoding physical knowledge; this approach supports a variety of inference, explanation, tutoring, and design tasks.

    This thesis presents as a case study novel approaches to decentralized control design, in the context of thermal regulation. This case study develops novel algorithms for control placement and parameter design for systems with large numbers of coupled variables. These algorithms exploit physical knowledge of locality, linear superposability, and continuity, encapsulated in influence graphs representing dependencies of field nodes on control nodes. The control placement design algorithms utilize influence graphs to decompose a problem domain so as to decouple the resulting regions. The decentralized control parameter optimization algorithms utilize influence graphs to efficiently evaluate thermal fields and to explicitly trade off computation, communication, and control quality. By leveraging the physical knowledge encapsulated in influence graphs, these control design algorithms are more efficient than standard techniques, and produce designs explainable in terms of problem structures. This case study demonstrates the utility of the Spatial Aggregation Language operators in supporting the programming of these computations in a vocabulary natural for the domain.

    thesis document

  • C. Bailey-Kellogg and F. Zhao, "The SAL Interpreter for Large-Scale Optimization in Distributed Control Systems," Proc. IEEE Symposium on Computer-Aided Control Systems Design, 1999.

    Many sensor-rich control systems interact with spatially distributed physical environments. This paper describes the Spatial Aggregation Language (SAL) interpreter, a programming environment to support data interpretation and control tasks for distributed physical systems. SAL provides a set of powerful, abstract components to represent and exploit spatial structures in distributed physical data at multiple levels of abstraction. The programming environment supports rapid prototyping of application programs and interactive manipulation of spatial structures. In comparison with existing tools, the SAL environment is centered on geometric and topological representations that encode and utilize domain-specific knowledge, such as metrics, adjacency relations, and equivalence predicates. We illustrate the use of SAL in decentralized control design for thermal regulation.

    postscript

  • F. Zhao and C. Bailey-Kellogg, "Intelligent simulation." AAAI Tutorial Forum, 1998.

    Intelligent simulation is a new problem-solving paradigm for data interpretation and control tasks in science and engineering. Because of rapid advances in information processing and microelectronics, many practical applications require real-time interpretation of information in order to effectively interact with the environment. The information is often in a data-rich form such as images, videos, or spatially distributed measurements of physical processes. For example, a network of computational agents embedded in a "smart building" must stitch together local measurements in order to ensure occupant comfort while minimizing energy consumption.

    This tutorial introduces a body of computational theories, techniques, and languages collectively called intelligent simulation. We will develop imagistic reasoning techniques for finding structures in large scientific and engineering data sets and the spatial aggregation (SA) language for rapid prototyping of imagistic problem solvers. SA draws upon the experience gained in developing applications in a number of challenging domains such as data analysis and visualization (KAM), control (MAPS), and mechanical design (HIPAIR); it incorporates techniques from computer vision, qualitative reasoning, scientific computing, and computational geometry. We will demonstrate how new applications can be prototyped with the SA language, using case studies including weather data interpretation, fluid simulation, and nonlinear maglev control design. No previous knowledge of intelligent simulation is required.

    slides handout (2 per page)

  • F. Zhao, C. Bailey-Kellogg, X. Huang, and I. Ordonez, "Intelligent simulation for mining large scientific data sets," New Generation Computing, 1999, 17:333-347.

    This paper describes problems, challenges, and opportunities for intelligent simulation of physical systems. Prototype intelligent simulation tools have been constructed for interpreting massive data sets from physical fields and for designing engineering systems. We identify the characteristics of intelligent simulation and describe several concrete application examples. These applications, which include weather data interpretation, distributed control optimization, and spatio-temporal diffusion-reaction pattern analysis, demonstrate that intelligent simulation tools are indispensable for the rapid prototyping of application programs in many challenging scientific and engineering domains.

    postscript

  • C. Bailey-Kellogg and F. Zhao, "Qualitative analysis of distributed physical systems with applications to control synthesis," Proc. AAAI, 1998, pp. 232-239.

    Many important physical phenomena, such as temperature distribution, air flow, and acoustic waves, are described as continuous, distributed parameter fields. Analyzing and controlling these physical processes and systems are common tasks in many scientific and engineering domains. However, the challenges are multifold: distributed fields are conceptually harder to reason about than lumped parameter models; computational methods are prohibitively expensive for complex spatial domains; the underlying physics imposes severe constraints on observability and controllability.

    This paper develops an ontological abstraction and a structure-based design mechanism, in a framework collectively known as spatial aggregation (SA), for reasoning about and synthesizing distributed control schemes for physical fields. The ontological abstraction models a physical field as a hierarchy of networks of spatial objects. SA applies a small number of generic operators to a field to compute concise structural descriptions such as iso-contours, gradient trajectories, and influence graphs. The design mechanism uses these representations to find feasible control configurations. We illustrate the mechanism using a thermal control problem from industrial heat treatment and demonstrate that the active exploitation of structural knowledge in physical fields yields a significant computational advantage.

    postscript (Copyright 1998, American Association for Artificial Intelligence (www.aaai.org)).

  • C. Bailey-Kellogg and F. Zhao, "Reasoning about and optimizing distributed parameter physical systems using influence graphs," Proc. 12th International Workshop on Qualitative Reasoning, Cape Cod, 1998.

    We develop the influence graph mechanism for reasoning about and optimizing decentralized controls for distributed parameter physical systems. Distributed parameter systems, such as air flow around an airplane wing, temperature over a semiconductor wafer, and noise from a photocopy machine, are common physical phenomena. The influence graph mechanism encodes the structural dependency information in a distributed parameter system and exploits the information to (1) alleviate redundant computation and (2) reduce communication and support cooperation among local control processes. Using the mechanism, we obtained a dramatic computational speed-up in optimizing control design for a distributed temperature field.

    postscript

  • C. Bailey-Kellogg and F. Zhao, "Spatial aggregation: modeling and controlling physical fields," Proc. 11th International Workshop on Qualitative Reasoning, Italy, 1997.

    Many important physical phenomena, such as temperature distribution, air flow, and acoustic waves, are described as continuous, distributed parameter fields. Controlling and optimizing these physical processes and systems are common design tasks in many scientific and engineering domains. However, the challenges are multifold: distributed fields are conceptually harder to reason about than lumped parameter models; computational methods are prohibitively expensive for complex spatial domains; the underlying physics imposes severe constraints on observability and controllability.

    This paper develops an ontological abstraction and an aggregation-disaggregation mechanism, in a framework collectively known as spatial aggregation (SA), for reasoning about and synthesizing distributed control schemes for physical fields. The ontological abstraction models physical fields as networks of spatial objects. The aggregation-disaggregation mechanism employs a set of data types and generic operators to find a feasible control structure, specifying control placement and associated actions that satisfy given constraints. SA abstracts common computational patterns of control design and optimization in a small number of operators to support modular programming; it builds concise and articulable structural descriptions for physical fields. We illustrate the use of the SA ontological abstraction and operators in an example of regulating a thermal field in industrial heat treatment.

    postscript

  • C. Bailey-Kellogg, F. Zhao, and K. Yip, "Spatial aggregation: language and applications," Proc. AAAI, 1996, pp. 517-522.

    Spatial aggregation is a framework for organizing computations around image-like, analogue representations of physical processes in data interpretation and control tasks. It conceptualizes common computational structures in a class of implemented problem solvers for difficult scientific and engineering problems. It comprises a mechanism, a language, and a programming style. The spatial aggregation mechanism transforms a numerical input field to successively higher-level descriptions by applying a small, identical set of operators to each layer given a metric, neighborhood relation and equivalence relation. This paper describes the spatial aggregation language and its applications.

    The spatial aggregation language provides two abstract data types -- neighborhood graph and field -- and a set of interface operators for constructing the transformations of the field, together with a library of component implementations from which a user can mix-and-match and specialize for a particular application. The language allows users to isolate and express important computational ideas in different problem domains while hiding low-level details. We illustrate the use of the language with examples ranging from trajectory grouping in dynamics interpretation to region growing in image analysis. Programs for these different task domains can be written in a modular, concise fashion in the spatial aggregation language.

    postscript (Copyright 1996, American Association for Artificial Intelligence (www.aaai.org)).