Lab 4 - Crawler
Due Friday April 29, at 10pm
In this lab you’ll begin the Tiny Search Engine (TSE) by coding the Crawler according to the Requirements Spec and Design Spec in the starter kit we provided on GitHub.
Grading will focus on CS50 coding style - including consistent formatting, selection of identifier names, and use of meaningful comments - in addition to correctness, testing, and documentation.
Your C code must compile without producing any compiler warnings. You will lose points if the compiler produces warnings when using our CS50-standard compiler flags.
If your submitted code fails to compile, or triggers a segmentation fault, you will fail all/some of our correctness tests. Write defensive code: each function should check its pointer parameters for NULL, and take some appropriate (safe) action. Write solid unit tests, test drivers, and use regression testing as your development proceeds.
If valgrind reports memory errors or memory leaks, when crawler exits normally, you will lose points.
Preparation
-
Accept the assignment, which includes a starter kit.
-
(OPTIONAL) Copy your
counters.c,set.c,hashtable.cfrom your Lab3 solution to thelibcs50subdirectory in your brand-new repository; trymakein that directory to ensure everything works.
Assignment
Write the first sub-system of the Tiny Search Engine: the crawler. Your implementation must follow the Crawler Requirements Spec and the Crawler Design Spec (aka “the Specs”, linked at the top), and make good use of our abstract data structures.
As you create files and directories, add them to your git repository.
Important: do not add any compiled files to git.
In short: your repo should not contain any data files, editor backup files, core dumps, or files built with make and removed by make clean.
Commit and push at least once a day.
In your tse-xxxx directory,
-
Edit the
README.mdfile to include your name; later, add any top-level comments you need to communicate to the grader. -
Review
Makefile, which calls Make recursively incrawlerandcommon;makeshould build everything andmake cleanshould clean everything. You may need to comment-out the lines forindexerandquerier, for now.
In the crawler directory,
-
Write a
README.mdfile to provide any special information about how to compile or test your crawler, or any assumptions you made while writing the crawler. -
Write an
IMPLEMENTATION.mdfile to describe the implementation of your crawler. See lecture notes about what should be in an implementation spec. -
Write a program
crawler.caccording to the Specs. -
Write a
Makefilethat will, by default, build thecrawlerexecutable program. -
Add a
make cleantarget that removes files produced by Make or your tests. -
Add a
make testtarget that tests your crawler. -
Write a
testing.shbash script that can be run bymake test. Read about testing below. This script must include good comments describing your tests. -
Save the output of your tests with
make test &> testing.out. -
Write a
.gitignorefile telling Git to ignore the binary files (object files) and other unnecessary files in this directory.
In the common directory, assemble code that will eventually be shared by the crawler, indexer, and querier.
For now, that comprises code for initializing the “pageDirectory” and saving webpages there.
(You will find reason to expand this file and this directory, when you write the other subsystems.)
-
Create files
pagedir.candpagedir.h. -
Write a
Makefilethat will, by default, build thecommon.alibrary. It must also have acleantarget that removes files produced by Make. -
Write a
README.mdfile to provide any special information about how to compile or use this library. This file is likely to be short. -
Write a
.gitignorefile telling Git to ignore the binary files (object files) and other unnecessary files in this directory.
Submission - what
Your submitted repo should include the following files, plus any others you created for testing:
├── .gitignore
├── Makefile
├── README.md
├── common # added by you
│ ├── .gitignore
│ ├── Makefile
│ ├── README.md
│ ├── pagedir.c
│ └── pagedir.h
├── crawler # added by you
│ ├── .gitignore
│ ├── DESIGN.md
│ ├── IMPLEMENTATION.md
│ ├── Makefile
│ ├── README.md
│ ├── testing.sh
│ ├── testing.out
│ └── crawler.c
├── libcs50
│ ├── .gitignore
│ ├── Makefile
│ ├── README.md
│ ├── bag.c
│ ├── bag.h
│ ├── counters.c # if you decided to add your Lab3 solution
│ ├── counters.h
│ ├── file.c
│ ├── file.h
│ ├── file.md
│ ├── hashtable.c # if you decided to add your Lab3 solution
│ ├── hashtable.h
│ ├── jhash.c
│ ├── jhash.h
│ ├── libcs50-given.a
│ ├── memory.c
│ ├── memory.h
│ ├── memory.md
│ ├── set.c # if you decided to add your Lab3 solution
│ ├── set.h
│ ├── webpage.c
│ ├── webpage.h
│ └── webpage.md
You shall NOT modify any of the code we provide in libcs50.
You shall NOT commit any data files produced by the crawler, any binary/object files produced by the compiler, backup files, core dumps, etc.
Submission - how
When you are ready for final submission,
- Add all required files using an appropriate
git addcommand line. - Commit all changes using an appropriate
git commitcommand line. - Tag:
git tag lab4submit - Push:
git push --tagsto ensure the tags are pushed to the remote. See more about tags.
Use your web browser to view your remote on GitHub and make sure there is a tag lab4submit and the files associated with that tagged commit include everything we’ll need to grade your lab.
We grade the version tagged lab4submit so it is safe for you to continue work (e.g., so you can start work on Lab 5), even if you commit or push.
We will start grading when we first see one with tag lab4submit, and judge it late if that tag was added after the deadline.
To avoid confusion, please email the professor and graduate TA if you want a late submission to be graded.*
Your entire code base must compile with make from the top-level directory and produce no compiler warnings.
We will run valgrind on your program to look for memory leaks; you will lose points for memory leaks or memory errors.
Hints and tips
libcs50
We provide several modules in the libcs50 directory, which compiles to a library libcs50.a you can link with your crawler.
You should not change any of our code.
You may drop in your set.[ch], counters.[ch], hashtable.[ch] files from Lab 3.
If you’d prefer to use our version of set, counters, and hashtable, that’s totally fine; the starter kit includes a pre-compiled library you can use instead; see libcs50/README.md.
Several of the modules in this directory are documented with a corresponding Markdown file; all of their interfaces are documented by their corresponding .h file.
Testing Crawler
It is your responsibility to test your crawler thoroughly. Consider following some of the testing plan discussed in Lecture.
Some safe seeds you might test include
- http://cs50tse.cs.dartmouth.edu/tse/letters/index.html
- http://cs50tse.cs.dartmouth.edu/tse/wikipedia/index.html
- http://cs50tse.cs.dartmouth.edu/tse/toscrape/index.html
The first two are good at depths 0 and higher, but the other cases are likely too slow to run past maxDepth=2.
The output of our crawler can be found at this link. Keep in mind that our crawler may process URLs in a different order, so your directory may not be identical to ours.
You may pick other test sites - but your seed URL must begin with http://cs50tse.cs.dartmouth.edu/.
Use valgrind and gdb
We’ve provided information about gdb and valgrind; make use of them.
Memory allocation
You may find the memory.c module useful.
Its use is optional, but it is ready to use as part of library libcs50.a.
In our Lab3 we tried to recover gracefully from memory-allocation failures.
In the TSE application programs, we’ll be more lazy: on NULL return from malloc/calloc you can print an error and exit with non-zero exit status.
(You may find the assertp() function, defined by the memory module, to be useful here.)
In such cases, it is ok if valgrind reports your program as having leaks.
Hashtable of URL
Our design calls for use of a hashtable to store the URLs already seen.
Our hashtable implementation stores a void* item with each key. But if you’re just storing URLs (as keys), what item do you store?
Our hashtable disallows NULL items.
Suggestion: just pass a constant string as the item; even "" will do.
Pseudocode
Your implementation does not need to follow our pseudocode; this is one reason why you need to write about your implementation choices in IMPLEMENTATION.md.
Our pseudocode refers to a pagefetcher; note this is effectively webpage_fetch().
Note that webpage_fetch() inserts a 1-second delay after every fetch attempt; your Crawler thus gets this behavior for free.
Our pseudocode refers to a pagescanner; yours should leverage webage_getNextURL().
pageDirectory
The Crawler is supposed to determine whether the pathname provided as pageDirectory is indeed a directory, and is writable (meaning, you can create files there).
How?
The simplest method is to create a file there.
If you construct a string filename that is the directory name, followed by a slash '/', followed by a file name like .crawler, you could call fopen(filename, "w"); if it returns NULL (failure), then either the directory does not exist or is not writable.
If it returns non-NULL (success), close the file.
It’s fine to leave .crawler in the page directory.
(Actually, the presence of this file will later help your indexer confirm the pageDirectory is indeed a Crawler-produced directory.)
This sort of detail is one reason we recommend writing a separate pagedir module in the common sub-directory, because its functions will need to be shared by crawler, indexer, and querier, and the details about the structure and format of that directory should not be scattered across those three programs, but rather contained within one clean module.
CS50 Spring 2022