Lab 4 - Crawler requirements spec
Crawler description | Requirements Spec | Design Spec | Implementation Spec
Requirements Specification
In a requirements spec, shall do means must do.
The TSE crawler is a standalone program that crawls the web and retrieves webpages starting from a “seed” URL. It parses the seed webpage, extracts any embedded URLs, then retrieves each of those pages, recursively, but limiting its exploration to a given “depth”.
The crawler shall:
- execute from a command line with usage syntax
./crawler seedURL pageDirectory maxDepth- where
seedURLis an ‘internal’ directory, to be used as the initial URL, - where
pageDirectoryis the (existing) directory in which to write downloaded webpages, and - where
maxDepthis an integer in range [0..10] indicating the maximum crawl depth.
- where
- mark the
pageDirectoryas a ‘directory produced by the Crawler’ by creating a file named.crawlerin that directory. - crawl all “internal” pages reachable from
seedURL, following links to a maximum depth ofmaxDepth; wheremaxDepth=0means that crawler only explores the page atseedURL, andmaxDepth=1means that crawler only explores the page atseedURLand those pages to whichseedURLlinks, and so forth inductively. It shall not crawl “external” pages. - print nothing to stdout, other than logging its progress; see an example format in the crawler Implementation Spec.
Write each explored page to the
pageDirectorywith a unique document ID, wherein- the document
idstarts at 1 and increments by 1 for each new page, - and the filename is of form
pageDirectory/id, - and the first line of the file is the URL,
- and the second line of the file is the depth,
- and the rest of the file is the page content (the HTML, unchanged).
- the document
- exit zero if successful; exit with an error message to stderr and non-zero exit status if it encounters an unrecoverable error, including
- out of memory
- invalid number of command-line arguments
seedURLis invalid or not internalmaxDepthis invalid or out of range- unable to create a file of form
pageDirectory/.crawler - unable to create or write to a file of form
pageDirectory/id
Definition:
A normalized URL is the result of passing a URL through normalizeURL(); see the documentation of that function in tse/libcs50/webpage.h.
An Internal URL is a URL that, when normalized, begins with http://cs50tse.cs.dartmouth.edu/tse/.
One example:
Http://CS50TSE.CS.Dartmouth.edu//index.html
becomes
http://cs50tse.cs.dartmouth.edu/index.html.
Assumption:
The pageDirectory does not already contain any files whose name is an integer (i.e., 1, 2, …).
Limitation: The Crawler shall pause at least one second between page fetches, and shall ignore non-internal and non-normalizable URLs. (The purpose is to avoid overloading our web server and to avoid causing trouble on any web servers other than the CS50 test server.)