Lab4 starts a series of three assignments that will build TinySearch. This lab will design, implement and test the crawler module.
The design of the crawler will be discussed in class as a collective design. We will develop the design and implementation in class. All students will implement the same DESIGN SPEC developed in class.
This is a challenging assignment. It includes data structures of link lists and a hash table and hash function. We will discuss both of these data structures in class.
This lab will cover many aspects of the C language discussed in class: structures, pointers, string processing, malloc/free, file operations and management, and interacting with the shell to execute wget via the system call (which you used in the last lab).
Importantly, for this lab you will use multiple files. Single functions or groups of related functions will go in their own .c file (e.g., crawler.c, list.c, html.c). You will also write header files for this lab which be included in the your various .c files.
We also use the GNU make command to help manage and compile these multiple files. Here is a snippet from man:
Due date: 5 PM, Monday April 28
Submitting assignment: The following sequence of Linux commands should be used to submit your work by the deadline:
Change to your labs directory cd ~/cs23/labs This directory contains your lab4 directory where your solutions are found.
Please make sure that the labs4 directory contains a simple text file, named README, describing anything “unusual” about how your solutions should be located, executed, and considered.
Issue the commands:
tar czvf $USER-lab4.tar.gz lab4
mail the “tarball” to cs23@cs.dartmouth.edu
You can use pine text mailerr (Program for Internet News and Email) to mail cs23@cs.dartmouth.edu
or
mutt -s "subject" -a ./$USER-lab4.tar.gz cs23@cs.dartmouth.edu
mutt is The Mutt Mail User Agent
Coding style: Please put comments in your programs to help increase its understanding. Include a header in each C file including the file name; brief description of the program; inputs; and outputs. It is important you write C programs defensively; that is, you need to check the program’s input and provide appropriate warning or error messages back to the user in terms of the program usage. If you call functions from your code make sure you check the return status (if available) and print out meaningful error messages and take suitable action. See testing below.
NEW: multiple files and make. Your director will include a make file that will be used to do a build of your system. We will discuss the make utility in class this week. As part grading we will do a build using make of your code. There should be no warnings from the compiler.
NEW: writing a bash test script to automatically test your crawler program. You will also be asked to write a test.sh that calls you crawler program multiple times with different parameters to check the operation of the program. For example, how does your program deals with various input some of which will be erroneous (e.g., bad URL, depth to large, etc.) and others good. Writing a test.sh script has the benefit that if you have changed code you can quickly rerun the new build (output of the make) against your test script as a sanity check that nothing has been broken. While we will discuss unit testing in more detail for Lab5 we will start with Lab4 to get students to write simple bash shells (these are tools that are specific to testing the program) to test their code. Note, we expect that the test.sh script and the log from the tests will be included in the tarball. Call the test script crawler˙test.sh and the log of what the test prints out should be directed to a file called crawler˙testlog.‘date‘ (i.e., crawler˙testlog.Wed Jan 30 02:17:20 EST 2008). Again, these two files must be included in your tarball. As part of your grade we will run you script against your code and look the new log produced. Please make sure that your test script writes out the name of the test, expected results, and what the systems outputs.
____________________________________________________________________________
Design, implement, and test (but not exhaustively at this stage) a standalone crawler the TinySearch. The design of crawler will be done in class. In the next lecture we will start to develop a DESIGN SPEC for the crawler module. We will also develop an IMPLEMENTATION SPEC. Based on these two specs you will have a blueprint of the system to develop your own implementation that you can test.
The crawler is a standalone program that crawls the web and retrieves webpage starting with a seed URL. It parses the seed webpage extracts any embedded URLs that are tagged as discussed above and retrieves those papers, and so on. Once TinySearch has completed at least one complete crawling cycle (i.e., it has visited a target number of Web pages which is defined by a depth parameter on the crawler command line) then the crawler process will complete its operation..
The REQUIREMENTS of the crawler is as follows. The crawler SHALL (a term here that means requirement)
The TinySearch architecture is shown in Figure 1. Observe the crawled webpages saved with unique document ID as the file name. The URL and the current depth of the search is stored in each file.
|
|
When does it complete. The crawler cycle completes when either all the URLs in the URL list are visited or an external stop command is issued. Note, the crawler stops retrieving new webpages once its as reached the depth of the depth parameter. For example, if the depth = 1 then only the seed page is retrieved and the pages of the URLs embedded in the seed page. If the depth = 2 then all the pages pointed to by the pages with embedded URL in the seed page are retrieved. The depth parameter tunes the number of pages that the crawler will retrieve
Need to sleep Because webservers DO NOT like crawlers (think about why) they will block your crawler based on its address. THIS is a real problem. Why? Imagine your launch your spiffy TinySearch crawler and crawl the New York Times webpage continuously and fast. The New York Times server will try and serve your pages as fast as it can. Imagine 100s of crawlers launched against the server? Yes, it would spend an increasing amount of time serving crawlers and not customers - hey people like the lecturer.
But, wait. What would the New York Times do if it detects you crawling to heavily from a domain dartmouth.edu? It would likely block the domain, i.e., the complete dartmouth community! What would that mean? Probably, Jim Wright wouldn’t be able to read his New York Times and I’m toast. So what should we do? Well let’s try and not look like a crawler to the New York Times website. Let’s introduce a delay. Just like spy - recall? - lets sleep for a period INTERVAL˙PER˙FETCH. Sneaky hey
The crawler command takes the following input:
The output of your crawler program should be:
Once the crawler starts to run it gets wget to download the SEED˙URL then its starts to process each webpage hunting for new URLs.A parser function (which we will provide to you) runs through its webpage looking for URLs. These are stored in a URLList for later processing. The crawler must remove duplicate URLs that it finds (or better it marks that it has visited a webpage (URL) and does not visit again even it it finds the same URL again. There are also conditions of stale URLs that give “Page Not Found”.
Below is a snipped when the program starts to crawl the CS webserver to a depth of 2. Meaning it will attempt to visit all URLs in the main CS webpage and then all URLs in those pages. The crawler prints status information as it goes along (this could be used in debug mode to observe the operation of the crawler as it moves through its workload). Note, you should use a LOGSTAUS macro that can be set when compiling to switch these status print outs on or off. In addition, you should be able to write the output to a logfile to look at later should you wish.
In the snippet, the program get the SEED˙URL page then prints out all the URLs it finds and then crawls http://www.cs.dartmouth.edu/index.php next. PHP is a scripting language that produces HTML - .php provides a valid webpage just like .html In the snippet it only get two webpages.
For each URL crawled the program creates a file and places in the file the URL and filename. But for a CRAWLING˙DEPTH = 2 as in this example there are a large amount of webpages are crawled and files created. For example, if we do a look at the files created in the [TARGET˙DIRECTORY] pages directory in this case, then crawler creates 184 files (184 webpages) of 3.2 Megabtes. That means the for a depth of 2 on the departmental webpage there are 184 unique URLs (in fact there maybe more - some could have been stale - someone want to check when the code the crawler?).
OK. You are done.
Note, that each webpage is saved as a unique document ID starting at 1 and incrementing by one. Below we less three files (viz. 1, 5, 139). As you can see the crawler has stored the URL and the current depth value when the page was crawled.
Tip: Make sure you always logout when you are done and see the prompt to login again before you leave the terminal.