K.S.M.T. Hossain, D. Patnaik, S. Laxman, P. Jain, C. Bailey-Kellogg, and N. Ramakrishnan, "Improved Multiple Sequence Alignments using Coupled Pattern Mining", Proc. ACM-BCB, 2012, to appear.

We present ARMiCoRe, a novel approach to a classical bioin- formatics problem, viz. multiple sequence alignment (MSA) of gene and protein sequences. Aligning multiple biologi- cal sequences is a key step in elucidating evolutionary rela- tionships, annotating newly sequenced segments, and under- standing the relationship between biological sequences and functions. Classical MSA algorithms are designed to primar- ily capture conservations in sequences whereas couplings, or correlated mutations, are well known as an additional impor- tant aspect of sequence evolution. (Two sequence positions are coupled when mutations in one are accompanied by com- pensatory mutations in another). As a result, better expo- sition of couplings is sometimes one of the reasons for hand- tweaking of MSAs by practitioners. ARMiCoRe introduces a distinctly pattern mining approach to improving MSAs: using frequent episode mining as a foundational basis, we define the notion of a coupled pattern and demonstrate how the discovery and tiling of coupled patterns using a max-flow approach can yield MSAs that are better than conservation- based alignments. Although we were motivated to improve MSAs for the sake of better exposing couplings, we demon- strate that our MSAs are also improvements in terms of traditional metrics of assessment. We demonstrate the ef- fectiveness of ARMiCoRe on a large collection of datasets.