A very common feeling when challenged with coding the crawler is to ask - “How the heck to I start this thing?”. In what follows, we propose a step by step approach to getting started wth coding the crawler. We first take a look at the pseudo code then detail each step. We break down each step into coding and debugging tips. You should use the gdb debugger to set break points and examin data structures such as the DICTIONARY, DNODES and URLNODES. The debugger is very powerful in this respect and allows you to print data structures and traverse the double link list of DNODES checking that pointers and data elements are consistent and correct - if not, you have a bug.
We suggest that once you fully understand the data flow and data structures you should then take a detailed look proat the peusdo code; that is, read the Design Spec and Implementation Spec from the notes.
In what follows, we discuss 8 steps that will get you started. We suggest that you code up the functions in crawler one by one and use the debugger to check for correctness. At the end of these notes we’ll give some tips about debugging tricky bugs.
Have fun. The crawler is cool. Once you have completed it you will have learnt a lot.
Let’s get started.
Remember to always code defensively – expect the unexpected with software!. Here the user enters a numer of parameters to the crawler program; for example:
$crawler www.cs.dartmouth.edu ../../data/ 2
Start by coding up the command line argument processing - you need to check that crawler is passed good parameters: has the user entered the right number of parameters, is the directorty valid (i.e., does it exist and is it writable?), is the depth parameter less than the maxmium allowed? If the user provides invalid paremeters then you need to exit gracefully with usage printout to the user to correct their input.
Use printf to print out error or usage messages to the user as discussed in the Input and output in the Design Spec.
Function to code: Code *initLists*: Initialize any data structure and variables.
Tips to code: The Dictionary should be malloced and initialized.
Tips to debug: use the debugger (dgb) to print out the dictionary once initialized. In the following snippet of a gdb session we start gdb by passing in the parameters. You can also do this by simply running gdb and the once the debugger is running type set args www.cs.dartmouth.edu ../../data/ 2. Use either method.
In all the gdb examples below we use the # to provide comments.
Function to code: Code and check *getPage(seedURL, current˙depth, target˙directory)* Get HTML page – this is a stricky function.
Tips to code: You have to use wget to store the seed page to a temp file. Determine the length of the temp file in terms of the number of bytes (+1 for the string termination 0 character that you need to append to the end of the string), malloc a buffer, read in the file into the buffer and add the 0 at the end of the string. Then return the pointer to the buffer.
Tips to debug: You can set a break point on the return from getPage and print out the string, you can make sure that the length of the buffer is the same as what wget prints outu, or, use the shell command “wc” on the temp file to check the number of characters/bytes - the length buffer should be the same the output of wget or wc + 1. This function also creates a file (viz 1..N) to store the URL, depth, hmtl. You can also use the shell command “less” on the file to check that it is the same as the temp file with the addition of the URL and depth – try using the shell command “diff” to do this smartly. These rudimentary checks give confidence that the getPage function is working correctly. It does not mean that bugs are not present. You need to run unit tests to determine this – we will discuss this later in the course.
Function to code: Code and check *extractURLs(page, SEED˙URL)* Extract all URLs from SEED˙URL page.
Tips to code: This function needs to use the parser GetNextURL (we give you the code to do this, phew) to extract the URLs from the page (the buffer string mentioned above). It should add each URL to the url˙list array if the URL has the same URL˙PREFIX (i.e., define URL_PREFIX ‘‘http://www.cs.dartmouth.edu’’). This is really important since we only want to search crawler CS pages and not the New York Times webpages! So you need to do a string comparison to check that the URL has the same prefix as above. The function needs to first initialize the complete url˙list to NULL; recall url˙list is an array of pointers to URLs (strings). Therefore, if a valid URL is found you have to malloc a string to copy the URL into and then add the pointer to the url˙list.
Tips to debug: Once you return from this function use gdb to look at the URLs in the url˙list – this function returns a pointer to url˙list which is simply the name of the array if you use a global variable - if you malloc url˙list then return the pointer.
Function to code: Code and check *updateListLinkToBeVisited(URLsLists, current˙depth + 1)* For all the URL in the URLsList that do not exist already exist in the dictionary add a DNODE and URLNODE (URL name, depth, visited) to dictionary list.
Tips to code: This is perhaps the most complicated function you will code for the crawler. At a high level it has to determine uniqueness of all URLs in the url˙list (it is very likely that your code will come across duplicates). We discussed in class how the dictionary can be used to determine uniqueness and how the dictionary can searched to determine uniqueness; and, if the URL is unique it shall be added at the end of the Dictionary list (i.e., dict->end) or at the end of cluster if there is a collision. The process of adding an unique URL requires the code to malloc and initialize a URLNODE and DNODE. Then, link the DNODE and URLNODE pair into the dictionary correctly.
Tips to debug: Set a break point after before and after this function is called and inspect the dictionary data structure. For the seed webpage you should get approximately 30 URLs (could be more or less). You should be able to run down the list and inspect pointers to DNODEs and URLNODEs and study their contents: is the structure intact? are all the pointers correct including start and end of the deictionary, and next prev for the DNODE? are all the elements in the datastructures that are not pointers correct (DNODE’s key, URLNODE’s url, visited, depth). If not you have a bug. Note, the debugger is very powerful. You can follow pointers and print out data structures with ease. Much better than using cumbersome printfs in your code.
Function to code: *setURLasVisited(SEED˙URL)* Mark the current URL visited in the URLNODE.
Tips to code: Simply set the current URL as visited.
Tips to debug: Simply use gdb to inspect the DNODE data structure before and after this function is called.
Function to code: *getAddressFromTheLinksToBeVisited(current˙depth)*. Get the next URL to be visited from the dictionary list (first one not visited from dict->start.
Tips to code: Simply traverse the dictionary’s DNODEs searching for a URLNODE not visited. If there aren’t any (all URLNODEs have been visited - we are done!) we return NULL and the crawler drops out of its “while loop” (see pseudo code) and cleans up.
Tips to debug: Simply check after the function is called that the URLNODE has not been visited. Also, find the URLNODE on the dictionary and make sure that the prev DNODE’s URLNODE has been visited, and that the next DNODE’s URLNODE is “not visited” (assuming that the current DNODE’s URLNODE is not the last item on the dictionary).
Function to code: *cleanup* Clean up data structures and make sure all files are closed, resources deallocated.
Tips to code: Simply breakdown all the data structures malloced by freeing them.
Tips to debug: Because free a chunk of memory still leaves pointers in data structures this is hard to check. But we will use a tool valgrind to check for memory leaks.
This is your first real encounter with dynamic memory management for data strcutures and pointers which drive the management and operation of the data structures. Many tricky bugs lead to segfaults that can be easily debugged – a pointer that is NULL when you expect an address to a malloced data structure. But bugs may manifest all sorts of edge cases leading to unstable code, infinite loops, fewer crawled files than expected. You might have a beautiful operational crawler at depth 2 but segfaults at depth 3; or you might crawl at level 3 but if you change the size of the hash table (MAX_HASH_SLOTS) from 10000 entries (which means the crawler nearly always finds a NULL pointer in the hash table) to 10 entries (which means your code is executing collision processing nearly all the time) then you might (or will likely ;-) segafult – these are edge cases. By changing a system parameter (i.e., MAX_HASH_SLOTS) we radically change the operation of the system forcing the program control through code not probably executed often. These edge case tests force pesky bugs out into the light so we can swat them – splat!.
Once you have coded each of the steps above and used the debugger to verify that the code functionally works you can go on to complete the crawler cycle which is just a while loop for all URLs to visit. You will likely experiennce some of the bugs mentioned above – many associated with your codes interaction with memory – that is data strcutures.
What do you do if you have a really non-obvious bug – that is, a tricky one?
1) run your program through valgrind, and note the line numbers of any errors. Examine the code around these lines and see if I can figure out what the problem is.
We haven’t covered valgrind yet other than the quick demo in class. Run this command on the command line once you have a clean make – on a single command line (it looks like it’s on two lines below but has to be one:
We will discuss valgrind more in class this week along with other techniques for testing and debugging. But here is a quick introduction to valgrind:
2) If you can’t simply figure out the problem by eyeballing your code (i.e., desk checking it), then start gdb and set breakpoints at some nearby lines – that is before and/or after the source code line numbers given by the valgrind error indicated in valout. Step through (potentially multiple times), examining variables (e.g., DNODEs) and the code until you understand “the problem” (always a good place to start) – and what valgrind is complaining about, such as, mememory leaks, writing/reading beyond malloced memory.
Start by looking at the valout file – this is the output of valgrind (see the command line). Note, valgrind runs much more slowly that non valgrind code. In addition, we set the verbose flag in the command. Ideally you should have no errors but if you do (which you will for sure) then clear all the illegal read and writes and memory leaks.
I would not continue with any further code development until you clear all the valgrind errors; your CLEAN (no errors) valout file should look like this for example:
3) Of course, there could always be edge cases where something completely unrelated to the code is at fault, but these are pretty rare
Once you have done 1-8 for the seed URL you will be in good shape to run in the “while loop” doing steps 3, 4, 5, 6 and 7 until no more URLNODE’s need to be visited. This is not to say there will not be bugs running more deeply in the list but you will now have a good sense of the “excuting data strcutures” and valgrind and gdb skills to attack those pesky bugs if lurking. Remember, printf is not your friend. Dump him now.