A Project of the D'Agents Laboratory at Dartmouth College
Brian Brewington, Bob Gray, George Cybenko
Since its inception scarcely a decade ago, the World Wide Web has become a popular vehicle for disseminating scientific, commercial and personal information. The web consists of individual pages linked to and from other >pages through Hyper Text Markup Language (HTML) constructs. The web is decentralized. Web pages are created, maintained and modified at random times by thousands, perhaps millions, of people around the world.
This system can be thought of as defining an "information space" much like what forms the basis of traditional information retrieval. Our research addresses how to uitilize an information space that is dynamic, such as the WWW. By "dynamic", we mean that documents are created and destroyed, or more loosely speaking, old topics disappear and new topics are created. When building an index of this content, it is understood that the information indexed has a certain lifetime during which it is useful. What are the best ways to deal with the ractical reality of maintaining such an index? >
The WWW is an excellent platform for investigating some of the fundamental ideas in this area, such as:
Towards these ends, we have been collecting data from a WWW clipping service developed at Dartmouth, called The Informant. We have a sample of about 1.6 million web pages, with an average of 5 observations of each page. This has enabled us to gain a great deal of insight into the dynamics of the Web.
The figure shows some of our data on the age distribution of documents on the World Wide Web. On the left, we show the cumulative distribution of web page ages. On the right is the corresponding probability density function. It has a sharp peak, but is quite broad. Note in the CDF that about 10% of web documents are younger than three days, half are younger than around 3 months, and 75% are younger than 3 years. This data is being brought to bear in estimating bandwidth requirements for a search engine that wishes to monitor such a set of dynamic information sources.
Relevant Dartmouth papers:
Brian Brewington. Ph.D Thesis Proposal: Optimal Observation with WWW Applications. 1998.
Related projects, inside D'Agents:
Self-Organizing Information Systems