A suffix array is a simple data structure that enables lookup of any
substring of a text and identification of repeated substrings.
It is more compact than a suffix tree and is amenable to storage
in secondary memory.
A suffix array is (notionally) a sorted list of the suffixes of
a given string. (A suffix is a substring that extends to the end of
the given string.) The sorted list is presented as an array of
integers that identify the suffixes in order.
The suffix array for the 11-letter text ABRACADABRA,
with an endmark # that collates low, is given below,
together with the corresponding suffixes.
Together, the suffix array and original string enable
binary search for any substring, e.g.
A useful auxiliary data structure is an `LCP array', an array of
lengths of the longest common prefix between each substring and
its predecessor in the suffix array.
The second column below gives the LCP array for the previous example.
The third column shows that a suffix array may
also be interpreted as an ordered list of circular shifts.
11 0 #ABRACADABRA
10 0 A#ABRACADABR
7 1 ABRA#ABRACAD
0 4 ABRACADABRA#
3 1 ACADABRA#ABR
5 1 ADABRA#ABRAC
8 0 BRA#ABRACADA
1 3 BRACADABRA#A
4 0 CADABRA#ABRA
6 0 DABRA#ABRACA
9 0 RA#ABRACADAB
2 2 RACADABRA#AB
The last column of letters,
ARD#RCAAAABB, is Wheeler's
transform, which is central to B-W data compression.
An introduction to its use may be found in the
Wikipedia article, "Burrows-Wheeler Transform".
The archive sarray.tar contains
C functions that compute suffix arrays efficiently by methods that
build on work of Manber and Myers. The functions
handle strings of integers, which may contain codes for
characters, words, or other types of data.
Updated April 6, 2010
- Header file.
- A simplification of the
Myers/Manber technique, by Peter McIlroy and Douglas McIlroy.
This is the program to study, but sarray.c is the one
for heavy lifting.
- An extensively engineered
hybrid of ssarray.c and
quicksort, usually more than five times as fast as ssarray.c,
by Sean Quinlan and Sean Dorward.
Also contains a specialized interface,
bsarray, tailored to byte-string data.
A similar program
appeared in N. Jesper Larsson's PhD thesis.
- Computes an LCP array
for a given
suffix array, by a linear-time method due to Kasai et al.
- Encodes a string into a canonical
form for input to ssarray or sarray.
- SuffixArray.java and SuffixArray.c
- Java interface to the C functions.
- Unix-style man page, troff source, also available in
gzipped PostScript, and PDF.