The problem of automatically organizing vast quantities of electronic data is critical in our digital world. The amount of information stored in a typical library or on the World Wide Web, for instance, is immense and growing at a phenomenal rate. In order to efficiently extract relevant pieces of information, users must be provided with sophisticated search mechanisms. Traditional query-based retrieval systems (such as Google and Alta Vista on the World Wide Web) allow the user to submit a query to which the system returns a collection of ``relevant'' documents. These systems often fail in the face of the vastness of the data in the system: often hundreds or thousands of documents are returned to the user, large numbers of which are irrelevant, forcing the user to manually search for the information required. Another approach is to provide the user with a pre-organized database (such as Yahoo! on the World Wide Web) through which the user can efficiently search. Unfortunately, these systems often require vast human resources to maintain, and they cannot hope to keep pace with the rate at which information is being accumulated.
To help alleviate this problem, we have developed algorithms and systems which can automatically organize vast quantities of digital data and assist users in searching through this data. Given an appropriate pairwise similarity function, our star cover algorithm can be used to automatically organize a collection of objects into tight clusters of mutually relevant objects [APR04,APR00b,APR00a,ARR00,APR98b,APR98a,APR97]. Unlike many previous algorithms which tended to be either inefficient or lacking in performance guarantees, the star cover algorithm is both efficient (linear time in the size of the input problem) and provably accurate (a lower bound on the pairwise similarity of intra-cluster objects is guaranteed in a geometric model of the information space). As a pre-processor, it can be used to automatically organize large databases of information; as a post-processor, it can be used to organize the results returned by a traditional query-based retrieval system. We further developed an on-line version of the star cover algorithm which can be used to automatically organize dynamic databases of information, such as the World Wide Web, and we developed a probabilistic model based on random graphs to show that the expected running time of this on-line algorithm is within lower order factors of the corresponding off-line algorithm [APR99,APR98b,APR00a]. Finally, we demonstrated that the star cover algorithm can be made scalable to large object collections through sampling techniques [ARR00]. The star cover algorithm has also proven useful in other domains such as document filtering [APR00b] and organizing data from functional magnetic resonance imaging studies.