Self-Organizing Information Systems

A Project of the D'Agents Laboratory at Dartmouth College

Katya Pelekhov, Daniela Rus

The amount of electronic information available to us is now greater than ever, and it keeps growing. Old techniques do not provide adequate relief from the problem of Information Overload (the feeling of opening your mail box and being greeted by 103 new, urgent!, e-mail messages, or receiving hundreds of thousands of document titles from a search engine, with that one relevant document ranked in 6th thousand). To alleviate this problem we look at the information organization: organizing electronic documents according to their topics and subtopics. This imposed organization is based on natural topics in the collection, and can be used to build a browsable hierarchy of topics. Unlike manually constructed topic structure of Yahoo!, our hierarchy is created automatically.

We can use organizing algorithm as a front-end to a search engine: classifying documents according to their topics allows user to quickly identify relevant and irrelevant topics, and concentrate on the former. Topic organization can be used to organize dynamic collections, such as your overcrowded e-mail box, or news wires. In news streams topics can be used to identify and track interesting stories. A new big story may be located automatically or in response to a user-created profile. An automatically-generated topic summary is particularly appreciated by users, and can contribute to cataloging and summarizing large and distributed collections.

We pay special attention to distributed and dynamic applications, as they are particularly relevant to information retrieval in mobile agents environment. We will look at a distributed agent-based search, and agent-based filtering and retrieval of relevant documents from a continuous news stream.

With J. Aslam we have developed a clustering algorithm suitable for organizing static and dynamic collections. This algorithm computes clusters induced by the natural topic structure of the space, and does not impose the constraint to use fixed number of clusters. As a result we can guarantee a lower bound on topic similarity, defined in terms of standard Vector Space model, between documents in each cluster. The clustering algorithm is based on covering graphs with dense subgraphs that are star-shaped. This algorithm is fast, can efficiently update clustering structure in response to addition/deletion of a document, and outperforms existing popular algorithms such as Average Link and Single Link. We developed applications that utilize this clustering algorithm in the areas of filtering, topic tracking, organizing search results, and browsing.

We put this algorithm to test in Distributed Search application. GUI allows user to type in query keywords, and select one or more collections (located on remote machines) to search. Relevant documents are retrieved from each selected collection, and presented as a ranked list of document titles. Clustering algorithm is used to organize the retrieved set of documents, and user is presented with a graphical view of resulting clusters. The graphical and text representations of retrieved results are interconnected, and allow user to view document titles in a cluster, estimate quality of clustering, locate similar clusters, and change clustering parameter to move from general to more focused topics. Search Screen

At this time we are working on the implementation of Persistent Queries. In this application, a hierarchy of clusters is maintained in the collection. User supplies a query, which triggers a process similar to regular retrieval: user is presented with the results organized by topics, and can select most relevant topic clusters. These relevant clusters are registered with the query system, and monitored as the collection is updated with new documents. If addition of a new document triggers a change in a monitored clusters, the user is notified of the update. The user can then select to retrieve new relevant documents or the entire relevant cluster.

General structure of the persistent query system

Algorithm used to select documents relevant to a query

Relevant projects, inside D'Agents:

Resource-allocation project


Relevant Dartmouth papers:

A Practical Clustering Algorithm for Static and Dynamic Information Organization: Javed Aslam, Katya Pelekhov, Daniela Rus and Fred Reiss. Algorithmica, to appear, 1999.

Autonomous and Adaptive Agents that Gather Information: Daniela Rus and Robert Gray and David Kotz. In AAAI~'96 International Workshop on Intelligent Adaptive Agents, pages 107-116, August, 1996. Proceedings available as AAAI Technical Report WS-96-04.