In the this lecture, we will discuss the detailed design of the crawler. Two specifications are discussed: (i) the Design Spec, which of a language independent specification of the crawler, and (ii), the Implementation Spec, which describes the prototype (APIs) and data structures representing the abstract data type for the crawler.

In the last lecture we discuss the detailed requirements of the Crawler. We identified three important Specs as part of our software design methodology (discussed in the same lecture); there are: Requirements Spec, Design Spec, and Implementation Spec. In today’s lecture we will translate the Requirement Spec into the Design Spec. Following that we will get into the implementation of the Crawler by translating the high-level and language/OS/HW independent Design Spec to the Implementation Spec which can (in our case) be compiled by gcc. Also, as part of Lab4 we will begin the notion of testing our code through writing a bash script that will automatically test the Crawler program as a single sub-system (at unit).

We will talk more about other important parts of the design methodology discussed in Lecture 11 such as version control (the cvs tool), automatic compilation and management of code with multiple files (the make tool) and debugging our code the (gdb tools) in subsequent lectures. We have reached a point in the course where we are writing complex (multiple file) programs: make and gdb tools will become our friends; more on make and gdb in the next lecture.

Toward the end of the lecture we will discuss the code we have given you to implement the Crawler. These include crawler.h which captures much of the Implementation Spec and crawler.c which includes material developed as part of the Design Spec (e.g., pseudo code, IO).

Goals

Crawler Design Spec

So let’s get started. Take a look at the Requirement Spec in the last lecture and then follow their translation into the top level design.

***Crawler Design Spec***
-------------------------

In the following Design Module we describe the input, data flow, and output specification
for the crawler module. The pseudo code for the crawler module is also given.

The Design Spec includes:

(1) Input: Any inputs to the module
(2) Output: An outputs of the module
(3) Data Flow: Any data flow through the module
(4) Data Structures: Major data structures used by the module
(5) Pseudo Code: Pseudo code description of the module.

Now, let’s look at how we define the inputs, outputs and the major data flow through the module.

(1) *Input*
-----------

Command Input

./crawler [SEED URL] [TARGET DIRECTORY WHERE TO PUT THE DATA] [MAX CRAWLING DEPTH]

Example command input

crawler www.cs.dartmouth.edu ./data/ 2

[SEED URL] www.cs.dartmouth.edu
Requirement: The URL must be valid.
Usage: The crawler needs to inform the user if URL is not found

[TARGET DIRECTORY] ./data/
Requirement: The directory must exist and be already created by the user.
Usage: The crawler needs to inform the user if the directory can not be found

[MAX CRAWLING DEPTH] 2
Requirement: The crawl depth cannot exceed 4
Usage: The crawler needs to inform the user the user exceeds the maximum depth

(2) *Output*
------------
For each webpage crawled the crawler program will create a file in the
[TARGET DIRECTORY]. The name of the file will start a 1 for the [SEED URL]
and be incremented for each subsequent HTML webpage crawled.

Each file (e.g., 10) will include the URL associated with the saved webpage and the
depth of search in the file. The URL will be on the first line of the file
and the depth on the second line. The HTML will for the webpage
will start on the third line.

(3) *Data Flow*
---------------
HTML webpages enter the module and are stored in the TARGET DIR as a temp file.
The HTML is read into a string (char *p) and parsed for URLs (looking for href)
producing a list of URLs from the page. The URLs are stored in a URLNODE data
structure and linked into a DNODE double linked list if they are unique.
The temp file is stored in the TARGET DIR as filename between 1..N (where N is
the number of unique webpages downloaded) with the first line in the file being the
URL, the second being the depth, followed by the contents of TEMP. The next page
is then retrieved and the process repeats.

Next, we define the major data structures. Here we are not interested in the details of the implementation of the data structures rather we define the major data structures, variables and data in a implementation independent manner. Later in the Implementation Spec we will write he C for the major data structures, variables and data.

(4) *Data Structures*
---------------------

current_depth - current depth of the page.

max_depth - maximum depth to crawl.

page - pointer to the current page (the string that holds the current webpage)

URLsLists - holds a list of URLs from the page by name only
(e.g., www.cs.dartmouth.edu/~campbell)

URLNODE structure holds the URL name, its depth and a visited flag (0 not visited
1 for visited)

URLToBeVisited - next URL to be visited.

Clearly, we need a data structure that will allow us to store URLs that are
returned iff they are unique. At this stage the definition of a double linked
list would be sufficient to manage the storing and searching (to see if a new URL
is already on the list. If so it is not added). We define a double link list
with a start and end.

Double Link NODE List
URL *start
URL *end

For now it is fine just to understand that that this is a list of some
data structure that holds information about the URL (such as URLNODE information).
In the Implementation Spec we will flush out the detailed C structures for all
of the above. But for now abstraction is fine. Details will come in the
Implementation Spec.

Now we move to formulating the code flow in terms of pseudo code - a plain non techie English like language which again is independent of the implementation language. The idea is that a programmer could take this DESIGN SPEC and code it in any language. The idea is that the complete Design Spec is defined in terms that are independent of any computer language. Therefore, you could code this is C, assembler, C++, JAVA. Granted that this does not call out any specific Object Oriented design methodology tools but the language and OS independent definition of the input, outputs, data flow, data structures, and pseudo code allow the programer to map this (conjecture) to any language (what do you think?).

(5) *Crawler* Pseudocode
------------------------

// Input command processing logic

(1) Command line processing on arguments
    Inform the user if arguments are not present
    IF target_directory does not exist OR depth exceeds max_depth THEN
       Inform user of usage and exit failed

// Initialization of any data structures

(2) *initLists* Initialize any data structure and variables

// Bootstrap part of Crawler for first time through with SEED_URL

(3) page = *getPage(seedURL, current_depth, target_directory)* Get HTML into a string and return as page,
            also save a file (1..N) with correct format (URL, depth, HTML)
    IF page == NULL THEN
       *log(PANIC: Cannot crawl SEED_URL)* Inform user
       exit failed
(4) URLsLists = *extractURLs(page, SEED_URL)* Extract all URLs from SEED_URL page.

(5) *free(page)* Done with the page so release it

(6) *updateListLinkToBeVisited(URLsLists, current_depth + 1)*  For all the URL
    in the URLsList that do not exist already in the dictionary then add a DNODE/URLNODE
    pair to the DNODE list.

(7) *setURLasVisited(SEED_URL)* Mark the current URL visited in the URLNODE.

// Main processing loop of crawler. While there are URL to visit and the depth is not
// exceeded keep processing the URLs.

(8) WHILE ( URLToBeVisited = *getAddressFromTheLinksToBeVisited(current_depth)* ) DO
        // Get the next URL to be visited from the DNODE list (first one not visited from start)

      IF current_depth > max_depth THEN

          // For URLs that are over max_depth, we just set them to visited
          // and continue on

          setURLasVisited(URLToBeVisited) Mark the current URL visited in the URLNODE.
          continue;

    page = *getPage(URLToBeVisited, current_depth, target_directory)* Get HTML into a
            string and return as page, also save a file (1..N) with correct format (URL, depth, HTML)

    IF page == NULL THEN
       *log(PANIC: Cannot crawl URLToBeVisited)* Inform user
       setURLasVisited(URLToBeVisited) Mark the bad URL as visited in the URLNODE.
       Continue; // We don’t want the bad URL to stop us processing the remaining URLs.

    URLsLists = *extractURLs(page, URLToBeVisited)* Extract all URLs from current page.

    *free(page)* Done with the page so release it

    *updateListLinkToBeVisited(URLsLists, current_depth + 1)* For all the URL
    in the URLsList that do not exist already in the dictionary then add a DNODE/URLNODE
    pair to the DNODE list.

    *setURLasVisited(URLToBeVisited)* Mark the current URL visited in the URLNODE.

    // You must include a sleep delay before crawling the next page
    // See note below for reason.

    *sleep(INTERVAL_PER_FETCH)* Sneak by the server by sleeping. Use the
     standard Linux system call

(9)  *log(Nothing more to crawl)

(10) *cleanup* Clean up data structures and make sure all files are closed,
      resources deallocated.

Because webservers do not like crawlers (think about why) they will block your crawler based on its address. This is a real problem. Why? Imagine your launch your spiffy TinySearch crawler to crawl the New York Times webpage continuously and fast. The New York Times server will try and serve your pages as fast as it can. Imagine 100s of crawlers launched against the server? Yes, it would spend an increasing amount of time serving crawlers and not customers - hey people like the lecturer.

But, wait. What would the New York Times do if it detects you crawling to heavily from a domain dartmouth.edu? It might block the domain, i.e., the complete dartmouth community! What would that mean? Probably, Jim Wright wouldn’t be able to read his New York Times and I’m toast. So what should we do? Well let’s try and not look like a crawler to the New York Times website. Let’s introduce a delay. Just like spy - recall? - lets sleep for a period INTERVAL_PER_FETCH. Sneaky hey

Implementation Spec

Given the Design Spec above we go the next level of detail with the Implementation Spec. The Implementation Spec should be language specific. It should compile without errors or warnings. It should present the code to implement the abstract data type of the crawler - i.e., the detailed data structures and the functions that operate on the data. The Implementation Spec is realized in the crawler.h file. In addition, the crawler.c file includes aspects of the Design Spec such as the high-level pseudo code for the module. Take a look at crawler.c and crawler.h linked into the end of this lecture notes.

***Crawler Implementation Spec***
---------------------------------

In this implementation specification we define the prototypes and data structures
in details. It defines the abstract data types (ADT) for the crawler design module.
This specification should code should compile without errors as the crawler.

The Implementation (crawler.h) includes:

(1) Prototype definitions: Any functions that are called from the main module.
(2) Data structures: Complete specification of major data structures and important
variables

//(1) *Prototype Definitions*

int initLists(void);

// FUNCTION: initLists. Initializes an data structures and resources
// GOOD:returns 0 if OK
// FAILURE: returns 1 in case of failure

char *getPage(char *url, int depth,  char *path);

// FUNCTION: Get HTML into a string and return as page, also save a file (1..N)
              with correct format (URL, depth, HTML)
// GOOD: returns pointer to a string which represents the complete HMTL webpage.
// FAILURE: return NULL string in case of failure

char **extractURLs(char *html_buffer, char *current);

// FUNCTION: extractURLs. Given pointer to a string of HTML and the current ULR
// GOOD:returns pointer to a set of pointers to URLs
// FAILURE: returns NULL string in case of failure

void updateListLinkToBeVisited(char **url_list, int depth);

// FUNCTION: updateListLinkToBeVisited. For all the URL
// in the URLsList that do not exist alread then add a URLNODE (URL name, depth,
// visited) to the NODE list (start, end).

void setURLasVisited(char *url);

// FUNCTION: setURLasVisited. Given pointer to a string of HTML and the current ULR

char *getAddressFromTheLinksToBeVisited(int *depth);

// FUNCTION:  getAddressFromTheLinksToBeVisited. Get the next URL to visit from the
// list that are stored already
// GOOD:returns next URL to be visited
// FAILURE: returns NULL string in case of failure

void cleanup(void);

// FUNCTION: cleanup.  Clean up data structures and make sure all files are closed,
// resources deallocated

// Standard Linux system calls used
//
// free(char *);
// unsigned int sleep(unsigned int seconds);
// void free(void *ptr);
// void *malloc(size_t size);
//
// Make sure you include the correct set of #include files for these sys calls
//

Now, we define the a number of important constants for the program. These at first-sight might seem arbitrary large. They are set for worst case scenarios.

// The max length of each URL path.

#define MAX_URL_LENGTH 1000

// Sleep interval of the crawler.

#define INTERVAL_PER_FETCH 1

// The URL we crawled should starts with this prefix.

#define URL_PREFIX "http://www.cs.dartmouth.edu"

// Set some magic numbers. These are large values.

// Unlikely to have more than an a URL longer that 1000 chars.
// The KEY is the same as a URL. The term KEY is just a
// general Dictionary/hash function term

#define MAX_URL_LENGTH 1000
#define KEY_LENGTH 1000

// Make the hash large in comparison to the number of URLs found
// for example depth of 2 on www.cs.dartmouth.edu approx. 200.
// This will minimize collisions and mostly likely slots will be
// empty of have one or two DNODES. Access is O(1). Fast.

#define MAX_HASH_SLOT 10000

// Unlikely to have more than an 1000 URLs in page

#define MAX_URL_PER_PAGE 10000

//(2) *DATA STRUCTURES AND VARIABLES*

// DATA STRUCTURES. All these structures should be malloc ’d

// This is the key data structure that holds the information of each URL.

typedef struct _URL{
  char url[MAX_URL_LENGTH];      // e.g., www.cs.dartmouth.edu
  int depth;                     //  depth associated with this URL.
  int visited;                   //  crawled or not, marked true(1), otherwise false(0)
} __URL;

typedef struct _URL URLNODE;

// Dictionary Node. This is a general double link list structure that
// holds the key (URL - we explained into today’s lecture why this is there)
// and a pointer to void that points to a URLNODE in practice.
// key is the same as URL recall.

typedef struct _DNODE {
  struct _DNODE  *next;
  struct _DNODE  *prev;
  void    *data;        //  actual data points to URLNODE
  char key[KEY_LENGTH]; //  actual (URL) key
} __DNODE;

typedef struct _DNODE DNODE;

// The DICTIONARY holds the hash table and the start and end pointers into a double
// link list. This is a unordered list with the exception that DNODES with the same key (URL)
// are clusters along the list. So you hash into the list. Check for uniqueness of the URL.
// If not found add to the end of the cluster associated wit the same URL. You will have
// to write an addElement function.

typedef struct _DICTIONARY {
  DNODE *hash[MAX_HASH_SLOT]; // the hash table of slots, each slot points to a DNODE
  DNODE *start;               // start of double link list of DNODES terminated by NULL pointer
  DNODE *end;                 // points to the last DNODE on this list
} __DICTIONARY;

typedef struct _DICTIONARY DICTIONARY;

// Define the dict structure that holds the hash table
// and the double linked list of DNODES. Each DNODE holds
// a pointer to a URLNODE. This list is used to store
// unique URLs. The search time for this list is O(n).
// To speed that up to O(1) we use the hash table. The
// hash table holds pointers into the list where
// DNODES with the same key are maintained, assuming
// the hash(key) is not NULL (which implies the URL has
// not been seen before). The hash table provide quick
// access to the point in the list that is relevant
// to the current URL search.

extern DICTIONARY *dict;

// This is the table that keeps pointers to a list of URL extracted
// from the current HTML page. NULL pointer represents the end of the
// list of URLs.

char *url_list[MAX_URL_PER_PAGE];

Source Files for Implementing the Crawler

crawler.h- Implementation Spec is captured in the crawler.h file. It defines the important defines, marcos, data structures and prototype functions.

crawler.c- Students add the code starting here. It includes, from the Design Spec (Pseudo code description of the crawler algorithm), inputs and outputs.

In the tarball sent you there is also a README file which we point to here for completeness.

CS 50 Software Design and Implementation

Lecture 12

TinySearch Engine: Crawler Design

Goals

Crawler Design Spec

Implementation Spec

Source Files for Implementing the Crawler