We used transportable agents primarily for distributed information access, in which a distributed collection of corpora is searched based on a query and the results extracted from each site are fused in a coherent picture. ÊThe main advantages of using agents in distributed information access are flexibility and performance. With agents, distributed collections can provide primitive operations rather than all possible search operations. An agent can combine these primitives into efficient, multi-step searches. ÊBy moving a small computation to the location of the data (with transportable agents), the network traffic and overall computation time is reduced.
We built information-access agents that interface with the well-known ``Smart'' information retrieval system. The Smart system is a successful statistical information-retrieval system that uses the vector-space model to measure the textual similarity between documents. ÊThe idea of the vector-space model is that each word that occurs in a collection defines an axis in the space of all words in the collection. ÊA document is represented as a weighted vector in this space. The premise of this system is that documents that use the same words map to neighboring points and that statistics capture content similarity. Our ``star'' algorithm organizes a document collection into clusters that are naturally induced by the topic structure of collection, via a computationally efficient cover by dense subgraphs [APR04,APR00b,APR00a,ARR00,APR98b,APR98a,APR97]; the star algorithm is described in more detail in the following section.
Our data is a distributed collection of document repositories, each running an information-retrieval system. In our prototype, each collection consists of computer science technical reports. ÊFor a given query, an information agent visits a sequence of sites; at each site, it interacts with the local Smart agent to search the local collection. The results retrieved are brought home, or used as relevance feedback to refine the query. We consider the advantages and disadvantages of mobile agents for this sort of task, and develop planning algorithms suitable to minimize overhead [BGM+99].
We also conducted a series of related, but unpublished, experiments to measure the scalability and performance of persistent queries in a large document collection. Our Standing Query Server (SQS) received keyword-based queries from clients, performed an initial database search for the first 50 matching documents, clustered the resulting documents using an implementation of the ``star clustering'' algorithm mentioned above, and then stored the query. When new documents arrived in the database, the system ran each stored query over the new documents, and recalculated the document clusters. If a new document joined one of the clusters selected as most relevant by the client, and was above a certain relevance threshold, the client was notified and could retrieve the document. We ran experiments to discover the effect of the relevance threshold, database size, and number of standing queries on the performance of the system, and determined that the algorithm was consuming significant memory resources to cache associations. In fact, the memory used was more than the combined size of the documents themselves, leading to performance tradeoffs between the number of new documents added and the number of standing queries that could be supported. Overall, the performance did not appear to be practical for the quickly-growing document collections that we targeted.