CS 23 Software Design and Implementation

Lab5

TinySearch: Indexer Lab

Lab5 is the second in a three part series of labs associated with building your very own search engine - TinySearch.

Now we have coded the crawler and have the files in our TARGET˙DIRECTORY we need to index them. But before you do another thing please read Section 4 on Indexing in “Searching the Web”, Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan (Stanford University). ACM Transactions on Internet Technology (TOIT), Volume 1 , Issue 1 (August 2001). This is important to get a good understanding of what indexing is all about. It represents a critical part of the search engines operation.

Note, you will have to come up with the Design Spec and the Implementation Spec for the Indexer in addition to coding, debugging, and testing it. We will discuss some aspects of the design below including data structures and functions but you are free to come up with your own design - which is a good excerise.

This is a another challenging assignment. It includes data structures of link lists and hash tables and functions. It certainly builds on Lab4. You will be able to reuse code. - refactor.

This lab will cover many aspects of the C that you are becoming more familiar with: structures, pointers, string processing, malloc/free, file operations and management. And, debugging seg faults!

This lab is different from the crawler since it runs as a standalone program and does not interact with other distributed components such as the crawler did (e.g., wget).

You need to provide functional decomposition of the problem into multiple files. You will submit a make file to build the indexer. Also include a bash script that tests your indexer program, as was the case with the crawler.

Due date: 5 PM, Monday May 5. (no extensions other than the optional one shot 48 hour extension)

Submitting assignment: The following sequence of Linux commands should be used to submit your work by the deadline:

Change to your labs directory cd ~/cs23/labs This directory contains your lab5 directory where your solutions are found.

Please make sure that the labs5 directory contains a simple text file, named README, describing anything “unusual” about how your solutions should be located, executed, and considered.

Issue the commands:

tar czvf $USER-lab5.tar.gz lab5

mail the “tarball” to cs23@cs.dartmouth.edu

We will always ACK your submission. If not ping Andrew.

You can use pine text mailer (Program for Internet News and Email) to mail cs23@cs.dartmouth.edu

or

  mutt -s "subject" -a ./$USER-lab5.tar.gz cs23@cs.dartmouth.edu

mutt is The Mutt Mail User Agent

Coding style: Please put comments in your programs to help increase its understanding. Include a header in each C file including the file name; brief description of the program; inputs; and outputs. It is important you write C programs defensively; that is, you need to check the program’s input and provide appropriate warning or error messages back to the user in terms of the program usage. If you call functions from your code make sure you check the return status (if available) and print out meaningful error messages and take suitable action. See testing below.

multiple files and make. Your directory will include a make file that will be used to do a build of your system. As part grading we will do a build using make of your code. There should be no warnings from the compiler.

Writing a bash test script to automatically test your crawler program. You will also be asked to write a test.sh that calls you indexer program. Note, we expect that the test.sh script and the log from the tests will be included in the tarball. Call the test script indexer˙test.sh and the log of what the test prints out should be directed to a file called indexer˙testlog.‘date‘ (i.e., crawler˙testlog.Wed Jan 30 02:17:20 EST 2008). Again, these two files must be included in your tarball. As part of your grade we will run you script against your code and look the new log produced. Please make sure that your test script writes out the name of the test, expected results, and what the systems outputs.

____________________________________________________________________________

Indexer Requirements

The indexer sub-system uses the files created by running your crawler to generate an inverted index (called index.dat) containing the occurrences of words found in the files. The goal of the indexer is to create an index containing identifiers to webpages where a certain word (e.g., fishing) appears including the frequency of the word found in webpages. The identifiers represent the names of the files (i.e., the progressive integer numbers used to name the files saved by the crawler).

The high level requirements of the indexer are as follows. The indexer SHALL (a term here that means requirement) do the following for each file/document found in the TARGET˙DIRECTORY

After all the documents have been processed, the indexer will save the information to a file. This file will be used by the query engine which will be the last in the TinySearch sequence of labs.

The indexer has to be able to recreate a the index for the query engine to use in Lab6. Therefore, the operation of the indexer is as follows:

Note, that the indexer has to create the index from the files (documents) and save it to a file (e.g., file.dat). But it also has to be able to “reload” the index data structure from the file.dat for the query engine to use. So you must write the code to “reload”. I suggest that you write a simple bash script that does the following because you need to test that a index can be written to a file and reloaded - and that it is the same index. The way to do that is to create the index and write the file.dat. Then reload the index and save it to another file copy.file.dat. We will discuss the indexer˙test.sh at the end of this note.

Extra Credit. Another important requirement that we will check is that there are no memory leaks. Your code should always free the structures it mallocs. In the crawler many submitted labs had memory leaks. In this lab we are giving extra credits for code that has no memory leaks which is good coding practice. In Lab6 there will be penalties for memory leaks.

The TinySearch architecture is shown in Figure 1. Observe the crawled webpages saved with unique document ID as the file name. The indexer uses this document ID to build the inverted index and parses the file for words and counts their occurrences which it stores in the data structures discussed in the next section.


PIC

Figure 1: Indexer: reads and parses the saved files and creates an inverted index before writing out the data to a stored file.


Command-line execution

The index command takes the following input as default. Please see the section on the bash script for automatically testing your code. That section describes an extension to the default command line:


./indexer  [TARGET_DIRECTORY WHERE TO PUT THE DATA] [RESULTS FILE NAME]

Example command input

./indexer ~atc/teaching/cs23/notes/tinysearch/pages words.dat

Data Structures

A key design issue is the choice of data structures for storing the index. We need a data structure capable of quickly retrieving the list of documents containing the words and their frequency.

A good design represents a list of elements (defined as WordList composed of WordNodes) where each element (WordNode) stores the information about a word and a list of the documents and the frequency of the word found in the document. The list represents another linked list (defined as DocumentList) where each element in the list (DocumentNodes) contains the document identifier and the number of occurrences inside the document.

The major data structures used by the indexer are as follows. (Note, that you are free design our own structures - these are just for guidance).


typedef struct _DocumentNode {
  struct _DocumentNode *next;        // pointer to the next member of the list.
  int document_id;                   // document identifier
  int page_word_frequency;           // number of occurrences of the word
} __DocumentNode;

typedef struct _DocumentNode DocumentNode;

typedef struct _WordNode {
  struct _WordNode *prev;           // pointer to the previous word
  struct _WordNode *next;           // pointer to the next word
  char word[WORD_LENGTH];           // the word
  DocumentNode  *page;              // pointer to the first element of the page list.
} __WordNode;
typedef struct _WordNode WordNode;

WordNode *WordList;

The inverted list is a WordList. In the following we assume the following mapping:


typedef struct _INVERTED_INDEX {
                                      // Start and end pointer of the dynamic links.
WordNode *start;                      // start of the list
WordNode *end;                        // end of the list
WordNode *hash[MAX_NUMBER_OF_SLOTS];  // hash slot
} __INVERTED_INDEX;

typedef struct _INVERTED_INDEX INVERTED_INDEX;


Logic

When a new word w in a document d is found then addWord() is called; the program checks if a WordNode exists in the list for the word.

If the element does not exist, a new WordNode is created (malloc) with a list composed of a single WordNode (malloc) containing the identifier of the document (d) and the number of occurrences of the word in the document.

If the element exists (i.e., a WordNode containing w exists), the program inserts a new DocumentNode containing the identifier of the document (d) and the number of occurrences of the work in the document.

The complexity of finding an element is linear. In order to speed up the retrieval of finding a word in the WordList, we use a hash table. The word itself can be used as hash key. The hash table itself is an array. Each element of the hash table contains a pointer to a linked list of WordNode (i.e., WordList). When a new word is created, a hash index is computed. The hash function returns a hash index between 0 and the maximum length of the hash table. Clearly, different words might hash to the same index value - this is called a collision which we saw happen when the URL was the key in the crawler program. A new WordNode is added to the hash table.

Consider an example. Assume that the hash index value of the word w corresponds to the position 45 in the hash table. When the element WordNode is created, the element is added at position 45. When the word w has to be retrieved, the hash function is computed, and the list in position 45 is checked (i.e., the program checks all the words in this list in order to find the WordNode with the field w).

The number of slots/bins MAX_NUMBER_OF_SLOTS should be large to to better performance (quicker access to the data structure).

To summarize, there are three possible solutions:

A dictionary (like the one used in crawler) is useful if we have a simple linked list and we want to simply add a hash table to access it without modifying the existing code. The use of the dictionary does not require any modification to the implementation of the linked list: we are only using pointers to the elements of the existing data structure.

Implementation

Start by thinking about the requirements, logic, and the design of the data structure. A good modular design of this sub-system could be based on the following set of prototype functions/pseudo code:


for d in storedDocuments {
 loadedDocument = loadDocument(d);
 documentId = getDocumentId(d);
 currentPosition = 0;
 while (currentPosition = getNextWordFromHTMLDocument(loadedDocument, word, position))
  updateIndex(index, word, documentId);
}

saveIndexToFile(index);

where:

Output

The indexer should write out the index to a file provided by the user.

Save the generated index to a text file in the following format:

cat 2 2 3 4 5  
moose 1 5 7  
...

Let us consider the entry cat 2 2 3 4 5: cat is the word, the number after cat is the number of documents containing the word cat; the following 2 3 means the document with identifier 2 has 3 occurrences of cat in it; the following 4 5 means that the document with identifier 4 has 5 occurrences of cat in it. moose 1 5 7 means that only one document with identifiers 5 has the word moose and moose occurs 7 times in this document.

A snippet of output could be:


address 19 2 1 8 4 21 1 22 1 37 1 38 1 54 1 61 1 72 2 79 3 84 1 85 1
88 1 90 1 109 2 126 1 153 1 159 1 188 1
addressed 2 138 1 140 1
addresses 1 101 1
adelfio 3 33 1 112 1 138 1
adelson 1 104 1
adhere 2 11 1 142 4
adhoc 1 88 1
adirondacks 1 191 1
aditikabir 1 51 1
adjacent 1 8 1
adjudication 1 137 2
adjunct 6 4 1 5 1 43 1 44 1 139 1 142 3
adjuncthood 1 44 1
adjuncts 1 139 1
adjust 2 106 1 153 1
adjusted 1 153 1
adjustment 1 8 1
adjustments 3 150 1 153 1 155 1
adlets 1 88 1
administered 1 3 1
administrative 48 1 1 2 1 3 2 4 1 5 2 6 1 7 2 8 1 9 1 10 1 12 1 13
1 14 1 15 1 16 1 18 1 19 1 20 1 21 1 25 1 55 1 100 1 1 20 3 121 1
137 1 138 1 139 1 140 1 141 1 142 1 143 1 169 2 171 1 172 1 173 1
174 1 175 1 176 1 177 1 178 1 179 1 180 1 18 1 1 182 1 183 1 184
1 185 1 186 1

Illustration of data structures in action

The main goal of the indexer is to keep an account of the words found in a document and its occurrence. If a word is common to a number of documents then the inverted index must keep a list of all documents that are word is found in and their occurence. The figure below shows an example inverted index. Note that the hash(“dog”) has a hash value of 47 and points to a WORKNODE that maintains the word (dog) and is linked into a double linked list where start and end are used to point to the start and end of the double linked list of WORDNODES, respectively.

Note that three DOCUMENTNODEs are linked off the WORKNODE. Note that this is an ordred single linked list (it is ordered by documnent ID). So three files has the word dog. Document 1 (file 1) has 11 instances and documents 2, and 5, 3 and 14 occurences, respectively. Note, that when you hash(“dog”) it returns a hash index o 47. The hash table holds pointers of WORDNODES.

Study this simple example.


PIC

Figure 2: Indexer: reads and parses the saved files and creates an inverted index before writing out the data to a stored file.


Similarity with the Crawler data structures

The INVERTED˙INDEX is the same as the DICTIONARY used by the crawler. I provides a general double linked list and a hash table. This time the key is a work such as dog or cat rather than the URl; for example hashIndex = hash(“dog”);

Start keeps a double linked list of WORDNODE, which are similar to DNODES. The WORDNODE fold the word (dog) and point to DOCUMENTNODES (like URLNODE in some sense). The DOCUMENTNODES maintains the document ID (file name 1 ...N) and the word frequency (how many dogs in document 1). Because other documents may also have the word dog DOCUMENTNODES are linked together in a single linked list for a particular word (dog). There is one WORDNODE per word. and the WORDNODE holds the start of the linked list of DOCUMENTNODES.

Similarity with the Crawler data structures

The index also has to be able to ”reload” the index data structure from the file.dat for the query engine to use. So you must write the code to ”reload”. I suggest that you write a simple bash script that does the following because you need to test that a index can be written to a file and reloaded - and that it is the same index. The way to do that is to create the index and write the file.dat. Then reload the index and save it to another file copy.file.dat

Then use the bash command diff to compare the two files. If they are the same then your code words like a dream.

Automatically testing indexer

You need to write the indexer˙test.sh to do this.

The flow of your script will be:

#!/bin/bash
make clean
make index

DATA˙PATH=../../pages
INDEX˙FILE=index.dat

# We index a directory, recorded it into a file(index.dat) and sort.
# Then we read index.dat to memory and write it back to see
# whether program can read in and write out index storage file correctly.

./index $DATA_PATH $INDEX_FILE $INDEX_FILE reloaded_written_file.dat

# you need to use sort - I’ll explain
echo ”Index have been builded, read and rewrited!”
sort [parameters]
diff [parameters) # check return value of diff
echo ”Index storage passed test!”
else echo ”Index storage didn’t pass test!”
fi
#clean up any files you used
rm -f

Have a think about the structure and implementation of these structures.

OK. You are done.

Tip: Make sure you always logout when you are done and see the prompt to login again before you leave the terminal.