BIB-VERSION:: CS-TR-v2.0 ID:: ncstrl.dartmouthcs//TR94-214 ENTRY:: January 20, 1995 ORGANIZATION:: Dartmouth College, Computer Science TITLE:: A 2-3/4-Approximation Algorithm for the Shortest Superstring Problem TYPE:: Technical Report (paper) REVISION:: 1 AUTHOR:: Armen, Chris AUTHOR:: Stein, Clifford NOTE:: The 'January' in DATE is an arbitrary placeholder. DATE:: January 1994 ABSTRACT:: Given a collection of strings S={s_1,...,s_n} over an alphabet Sigma, a superstring alpha of S is a string containing each s_i as a substring, that is, for each i, 1<=i<=n, alpha contains a block of |s_i| consecutive characters that match s_i exactly. The shortest superstring problem is the problem of finding a superstring alpha of minimum length. The shortest superstring problem has applications in both computational biology and data compression. The problem is NP-hard [GallantMS80]; in fact, it was recently shown to be MAX SNP-hard [BlumJLTY91]. Given the importance of the applications, several heuristics and approximation algorithms have been proposed. Constant factor approximation algorithms have been given in [BlumJLTY91] (factor of 3), [TengY93] (factor of 2-8/9), [CzumajGPR94] (factor of 2-5/6) and [KosarajuPS94] (factor of 2-50/63). Informally, the key to any algorithm for the shortest superstring problem is to identify sets of strings with large amounts of similarity, or overlap. While the previous algorithms and their analyses have grown increasingly sophisticated, they reveal remarkably little about the structure of strings with large amounts of overlap. In this sense, they are solving a more general problem than the one at hand. In this paper, we study the structure of strings with large amounts of overlap and use our understanding to give an algorithm that finds a superstring whose length is no more than 2-3/4 times that of the optimal superstring. We prove several interesting properties about short periodic strings, allowing us to answer questions of the following form: given a string with some periodic structure, characterize all the possible periodic strings that can have a large amount of overlap with the first string. RETRIEVAL:: For a paper copy, email RETRIEVAL:: For a paper copy, write to Technical Report Librarian Department of Computer Science Dartmouth College 6211 Sudikoff Laboratory Hanover, NH 03755-3510 USA RETRIEVAL:: PDF at http://www.cs.dartmouth.edu/reports/TR94-214.pdf END::