Lab 6 - Querier
Due Wednesday May 18, at 10pm
In this lab you’ll continue the Tiny Search Engine (TSE) by coding the Querier according to the Requirements Spec in the starter kit.
Grading will focus on CS50 coding style - including consistent formatting, selection of identifier names, and use of meaningful comments - in addition to correctness, testing, and documentation.
Your C code must compile without producing any compiler warnings. You will lose points if the compiler produces warnings when using our CS50-standard compiler flags.
If your submitted code fails to compile, or triggers a segmentation fault, you will fail all/some of our correctness tests. Write defensive code: each function should check its pointer parameters for NULL, and take some appropriate (safe) action. Write solid unit tests, test drivers, and use regression testing as your development proceeds.
If valgrind reports memory errors or memory leaks, when querier exits normally, you will lose points.
Preparation
-
Start with the same repository you used for Lab 5. Before you begin, make sure you submitted Lab 5 correctly, and pushed the tag to your GitHub remote.
-
Check to ensure your local repo is clean with
make cleanand everything has been committed and pushed according togit status. -
Do not proceed if you have uncommitted changes or unpushed commits. Seek help if you need to sort out your repo or GitHub.
Assignment
Write the third sub-system of the Tiny Search Engine, the querier. Your design and implementation must follow the Querier Requirements Spec (aka “the Specs”), and make good use of our abstract data structures.
As you create files and directories, add them to your git repository.
Important: do not add any compiled files to git.
In short: your repo should not contain any data files, editor backup files, core dumps, or files built with make and removed by make clean.
Commit and push at least once a day.
In your tse-xxxx directory,
-
Review the
README.mdfile; add any top-level comments you need to communicate to the grader. -
Review
Makefile; you may need to uncomment the lines forquerier.
In the querier subdirectory,
-
Write a file
DESIGN.mdto provide the Design Spec for querier, drawing inspiration from the lecture notes. -
Write a file
IMPLEMENTATION.mdto provide the implementation spec and testing plan for querier. See lecture notes about what should be in an implementation spec. -
Write a
README.mdfile to provide any special information about how to compile or test your querier, any assumptions you made while writing the querier, and indicating how much of the functionality you implemented, as described below. -
Write a program
querier.caccording to the Specs and your Design. - Your program shall leverage common code from
common.ain the../commondirectory:- a module
index.cthat contains all the logic for saving and loading index files (this module is common between the Indexer, Querier, and indextest); - a module
pagedir.cthat contains all the logic for saving pages to a crawler output directory, for loading pages from a crawler output directory (this module is common between the Crawler and Indexer); - a module
word.cthat implementsNormalizeWord(this module is common between the Indexer and Querier); - a
Makefilethat buildscommon.afor use by the programs; - a short
README.mdfile explaining how to buildcommon.a; - a
.gitignorefile telling Git to ignore the binary files (object files) in this directory.
- a module
- Write a
Makefilethat will, by default, build thequerierexecutable program. - Add a
make cleantarget that removes files produced by Make or your tests. - Add a
make testtarget that tests your querier. - Write a
testing.shbash script that can be run bymake test. This script must include good comments describing your tests. For best results,make testshould runbash -v testing.sh. -
Save the output of your tests with
make test &> testing.out. - Write a
.gitignorefile telling Git to ignore the binary files (object files) and other unnecessary files in this directory.
Your submitted repo should now include the following files, plus any others you created for testing:
├── .gitignore
├── Makefile
├── README.md
├── common # added by you (Lab4)
│ ├── .gitignore
│ ├── Makefile
│ ├── README.md
│ ├── index.c # added by you (Lab5)
│ ├── index.h # added by you (Lab5)
│ ├── pagedir.c
│ ├── pagedir.h
│ ├── word.c # added by you (Lab5)
│ └── word.h # added by you (Lab5)
├── crawler # added by you (Lab4) except REQUIREMENTS, DESIGN
│ ├── .gitignore
│ ├── REQUIREMENTS.md
│ ├── DESIGN.md
│ ├── IMPLEMENTATION.md
│ ├── Makefile
│ ├── README.md
│ ├── testing.sh
│ ├── testing.out
│ └── crawler.c
├── indexer # added by you (Lab5) except REQUIREMENTS
│ ├── .gitignore
│ ├── REQUIREMENTS.md
│ ├── DESIGN.md
│ ├── IMPLEMENTATION.md
│ ├── Makefile
│ ├── README.md
│ ├── indexer.c
│ ├── indextest.c
│ ├── testing.sh
│ └── testing.out
├── libcs50
│ ├── .gitignore
│ ├── Makefile
│ ├── README.md
│ ├── bag.c
│ ├── bag.h
│ ├── counters.c # if you decided to add your Lab3 solution
│ ├── counters.h
│ ├── file.c
│ ├── file.h
│ ├── file.md
│ ├── hashtable.c # if you decided to add your Lab3 solution
│ ├── hashtable.h
│ ├── jhash.c
│ ├── jhash.h
│ ├── libcs50-given.a
│ ├── memory.c
│ ├── memory.h
│ ├── memory.md
│ ├── set.c # if you decided to add your Lab3 solution
│ ├── set.h
│ ├── webpage.c
│ ├── webpage.h
│ └── webpage.md
├── querier # added by you (Lab6) except REQUIREMENTS, fuzzquery
│ ├── .gitignore
│ ├── REQUIREMENTS.md
│ ├── DESIGN.md
│ ├── IMPLEMENTATION.md
│ ├── Makefile
│ ├── README.md
│ ├── fuzzquery.c
│ ├── querier.c
│ ├── testing.sh
│ └── testing.out
You shall NOT modify any of the code we provide in libcs50.
You shall NOT commit any data files produced by the crawler, any binary/object files produced by the compiler, backup files, core dumps, etc.
Submission - how
When you are ready for final submission,
- Add all required files using an appropriate
git addcommandline. - Commit all changes using an appropriate
git commitcommandline. - Tag:
git tag lab6submit - Push:
git push --tagsto ensure the tags are pushed to the remote. See more about tags.
Use your web browser to view your remote on GitHub and make sure there is a tag lab6submit and the files associated with that tagged commit include everything we’ll need to grade your lab.
We grade the version tagged lab6submit so it is safe for you to continue work (e.g., so you can start work on Lab 6), even if you commit or push.
We will start grading when we first see one with tag lab6submit, and judge it late if that tag was added after the deadline.
To avoid confusion, please email the graduate TA and the instructor if you want a late submission to be graded.
Your entire codebase must compile with make from the top-level directory and produce no compiler warnings.
We will run valgrind on your program to look for memory leaks; you will lose points for memory leaks or memory errors.
Grading
Lab 6 is scored on the basis of 100 points, with Delivery, Documentation, Style, Testing comprising most of the points.
“Functionality” represents 40/100 points. In recognition that you might find it best to start simple and slowly enhance your solution as you get the simpler version working, you can earn points as follows:
- 10 points if your querier prints the set of documents that contain all the words in the query; you may treat operators (‘and’, ‘or’) as regular words.
- 20 points if your querier also supports ‘and’ and ‘or’ operators, but without precedence (in mathematical terms, it treats them as left associative, equal precedence operators).
- 30 points if your querier also supports ‘and’ precedence over ‘or’.
- 40 points if your querier also prints the document set in decreasing order by score, thus meeting the full specs.
Partial credit is available, of course, per the judgement of the grader, but above is the coarse-grain rubric.
Please indicate in your querier/README.md which of the above subsets of functionality you implemented.
Hints and tips
Many of the Lab4 hints and Lab5 hints are still relevant.
Processing a query and ranking the results are tricky. We encourage you to start with a simplified case, test it thoroughly, then enhance. Easier to code, to test, and to debug, and when facing a deadline it’s nice to have a less-functional program that works than a full-functional program that doesn’t work. See the above section on Grading regarding the points allocated as you achieve higher levels of functionality.
Review the lecture notes
There are some examples and design tips in the lecture notes.
Hashtable
How big should your hashtable be? Well, you can know how many words it will store - because the index file has one word per line, and you can count the number of lines in the index file before creating an index data structure and before loading the file into the structure. Just think about how the hash-table size (in slots) might relate to the number of words it will need to store.
Parsing queries
We strongly recommend that your code read the entire query (a line of input) into a single string, then tokenize the query string. That is, you should write a function that takes a string and builds an array of words; it should use white space (space or tab) as the delimiter; each word can be normalized (lower-cased) and checked for illegal characters before being added to the array. See one approach at bottom.
Now that all the character-by-character parsing is behind you, and you have an array of words, you can step through the array to print a clean query, that is, with no extraneous spaces and all letters in lower case.
You can then step through the array according to the structure defined in the BNF. Two tips:
- Validate the basic structure: neither the first or last word may be an operator, and two operators may not be adjacent. If valid, proceed to next step; otherwise print a suitable error message.
- Structure your code to follow the structure of the grammar, which has two non-terminals (
queryandandsequence), to write a function that handles each.- One function to return set of documents that satisfy an
andsequence, looping over words in theandsequence; accumulate an answer (like a running product) as you go. - Another function to return set of documents that satisfy a
query, looping over calls to the above function for eachandsequence; accumulate an answer (like a running total) as you go.
- One function to return set of documents that satisfy an
See Lecture notes for more hints about how this might work.
Combining results
Suppose you have one counters object representing the set of documents in which a given word appears, and another counters object representing the set of documents in which another word appears; each counter set is really a set of (docID, count) pairs.
How do you combine them?
Recall Lecture.
If you are processing wordA AND wordB, the set of documents that match both words is the intersection of the two sets, and the score for each document (per the specs) is the minimum of the count for each document.
So you need a way to intersect two counters; we recommend iterating over one set and, for each item, checking whether that document exists in the other set; update the first set according to what you find.
You can use counters_iterate, counters_get, and counters_set to do this, without modifying your counters module.
If you are processing wordA OR wordB, the set of documents that match either word is the union of the two sets, and the score for each document (per the definition above) is the sum of the count for each document.
So you need a way to union two counters; we recommend iterating over the second set and, for each item, checking whether that document exists in the first set; update the first set according to what you find.
Again, you can use counters_iterate, counters_get, and counters_set to do this, without modifying your counters module.
While processing an andsequence you can compute a ‘running product’, that is, a counters object that represents the intersection of everything seen so far in the sequence.
When processing a query, that is, a disjunction of andsequence results, you can similarly compute a ‘running sum’, that is, a counters object that represents the union of everything seen so far in the sequence.
Ranking results
After parsing and interpreting a query, you will likely have a counters object representing the score for each satisfying document.
The counters module does not have a ‘sort’ method or a way to iterate over the items in sorted order.
We suggest you use counters_iterate() in which the itemfunc inserts the items into a new array of structs, each struct representing one satisfying document (docID, score).
Use counters_iterate twice - once to count the number of items in the set so you can allocate an appropriate-sized array, and again to stuff each item into the new array.
If your itemfunc uses an insertion-sort approach to drop the new item into the array so the array is sorted in decreasing-score order, you end the iteration with a nicely sorted array of structs.
Then you can simply loop over the array to print out the list of documents.
Testing your querier
Your querier reads queries from stdin, one per line.
You can test it interactively, but to do thorough and repeated testing you can write a collection of little files, each of which contains one or more queries to your querier, and run commands like ./querier < testquery.
You might write a short bash script to run the querier through several such test files.
That script might even compare the output to known-good output, for regression testing.
We provide fuzzquery.c, a fuzz-testing program, for testing your querier.
Its use is optional.
If your indexer never quite worked, never fear. You do not need a working indexer to write or test your querier. Try your querier on the output of our crawler and indexer.
Chopping a string into array of strings
Here’s a nice trick for breaking a single string char* line into an array of strings char* words[].
In the example below, the query has two words, and there are leading spaces, separating spaces, and trailing space, terminated by a null.
Because in C a “string” is just a pointer to an array of characters terminated by a null \0 character, we can chop this single string into multiple strings by replacing two of the spaces with nulls, and recording a pointer to the beginning of each word.
We don’t need to allocate any memory for these new strings, or copy these strings - they stay right where they are.

To make this happen, think about sliding two pointers along the array, starting at line.
One pointer might be char* word and you slide it over to the first non-space; the other pointer might be char* rest and you slide it from word to the first non-letter.
Squash that character *rest with a null, and you’ve created a null-terminated string starting at word.
Think carefully about the edge cases, as you construct the loops and slide the pointers.
ctype
You may find the functions isalpha() and isspace() useful; read the man page.
They are from the ctype.h package.
Turning off the prompt
I found it useful to print a prompt for an interactive user (stdin is a “tty”, aka keyboard), but not print a prompt when the stdin is not a keyboard (it makes my test-output look nicer). I wrapped it in a little function to abstract the details:
#include <unistd.h> // add this to your list of includes
/* The fileno() function is provided by stdio,
* but for some reason not declared by stdio.h, so declare here.
*/
int fileno(FILE *stream);
static void
prompt(void)
{
// print a prompt iff stdin is a tty (terminal)
if (isatty(fileno(stdin))) {
printf("Query? ");
}
}
CS50 Spring 2022