COSC 8 -- Problem Solving with CS

CS 8, Fall 2009

Problem Set 6: Parsing and Manipulating HTML


General Instructions

Before you do anything, please read the entire assignment carefully. It contains many suggestions which can save you a lot of struggle later on.

You should print out hard copy of your code and any example outputs, and either turn them in at lecture or in the TA's mailbox at Sudikoff by class time on Wed. Nov. 18, 2009. Late assignments will be strongly penalized, so please start early.

For this assignment, you may work either alone, or with one (1) other partner. You may not share your code with anyone other than your partner, nor may you look at anyone else's code. See the Course Information for a review of the policies on joint work.

What to Turn In

For CS 8 programming assignments, such as this one, you must turn in a hard copy listing of any procedures you wrote yourself, or modified from the example code provided with the assignment. You must also turn in example output for each section of the problem set, as applicable.

You should not turn in printouts of any code we provided you which you did not make any changes to, and please be sure to label everything carefully, so that the graders will know what they are looking at.

You must also e-mail an electronic version of your code to cs8hw@cs.dartmouth.edu with the subject "CS 8 Problem Set 6", by no later than 2:00 am on the date it is due. Please organize your code in one or more plain text files and attach them as enclosures to your message. Do not include your code directly in the body of the e-mail message! If you are sending more than one file, please make sure the files have descriptive names, and that you include comments that will let the graders know what they are looking at.

Code submitted electronically must be sent as plain text, in one or more files with the filename extension ".hs" or ".lhs" (e.g., mycode.hs). Do not send your programs as Word documents, RTF, or other "formatted" types of files.

Output

For each part of any assignment, we require output which demonstrates that your code correctly solves the problem it is supposed to solve. If you turn in code with no output, you may receive only partial credit, or no credit, for that part of the assignment.

Some problems will give specific test cases you need to provide output for. However, even if specific test cases are not provided, and even if it is not listed under "What to Turn In" for a particular problem, you must provide some sample output that shows how your code behaves on typical inputs.

If you are unable to get your code working, please turn in whatever you do have, so that we can determine if you should receive partial credit. Please do not include output which suggests your code works if it doesn't -- however, if your code works for some of the test cases, and not for others, you should turn in whatever output you are able to generate, even if it is incorrect. You will get credit for the "Testing" part of the score, even if the tests indicate that the program is incorrect.

Output is not required to be submitted electronically, although you are welcome to do so if you wish. If you do choose to submit output electronically, put your output samples in a separate file from your code submissions. Output may be submitted in any reasonable file format, though plain text is easiest to read.


Background Information on HTML

HTML is short for "hypertext markup language". It is the primary language for specifying the content and appearance of web pages. In this assignment we will be looking at what is needed to parse HTML into a tree-like internal structure (called a Document Object Model, or DOM), and then some of the ways that we can manipulate the DOM after it is created.

In this section I will give an introduction to HTML. I will include details about the meanings of tag names and attributes that will not really matter for doing the assignment. However, I hope that learning about HTML and how web pages are constructed will be of interest and will help you appreciate why parsing a web page is useful.

The HTML language consists of text to be displayed and tags that give information about how it is to be displayed. Most tags come in pairs: an opening tag and a closing tag. An example is the "b" tag, which specifies that the text between the opening and closing tags is to be printed in a bold font. For example, <b>This text will be printed in a boldface font.</b> The opening tag consists of a '<' character, the tag name, and a '>' character. The tag itself is an alphanumeric name, in this case "b". The closing tag consists of the '<' character followed by the '/' character, the tag name, and the '>' character.

Tags can be nested. Consider:

  <ol>
     <li><b>First item.</b> Note we close list items.</li>
     <li><b>Second item.</b> Formerly this was not required, but it is required in 
            the new HTML 4.0 standard.</li>
     <li><b>Third item.</b> It simplifies the parsing!</li>
  </ol>

Here there are three different tags: "ol", which creates an ordered list, "li", which designates a list item within a list, and "b", the bold tag that we have seen before.

Some tags also have attributes, which give the browser additional information. An example is the "body" tag: <body bgcolor=white text=black link=green vlink=purple alink=red> Note that after the tag name there is a series of assignments of the form "attributeName=value". What the attributes control depends on the tag type. Here they control the colors used when displaying the document in a browser. The choices here are a white background color, black text, green coloring of links that have not yet been followed, purple coloring of links that have been followed, and red coloring of links when the mouse button is held down while selecting the link.

A particularly useful attribute is named "href", and it is used with the "a" tag. The sequence: <a href="http://www.cs.dartmouth.edu/~cs8/F2009/psets/sa8/sa8.shtml">SA 8</a> specifies that the text between the end of the opening tag and the beginning of the closing tag ("SA 8" in this case) is to be an "anchor", a piece of text that can be named and jumped to or used as a link. The "href" attribute establishes a link to the URL that follows it. Note that this URL is a quoted string.

There are tags, for instance "hr" (for "horizontal rule", a line across the page) that have no text associated with them. It is legal to say <hr></hr>, but it seems a little pointless! HTML allows a condensed form for "self-closing" tags. In this case, a single tag suffices. The form is: <hr/>. The "/>" at the end of the tag says that the tag is self-closing, and there will be no separate closing tag.

A more complicated example of a self-closing tag is the "img" tag for including an image in the web page. For example, <img src="http://www.cs.dartmouth.edu/~cs8/F2009/img/main-title.png" width=503 height=160 alt="COSC 8--Problem Solving with Computer Science"/> has four attributes. The "src" attribute is the URL of the image to be displayed. The "width" and "height" attributes define the size of the box to which the image will be scaled. Finally, the "alt" tag gives a text description of the image, for those browsers which do not download images or those people who cannot see images.

To see a small sample web page that I will ask you to parse and the html that causes it to be displayed, see testing.html. If you select "View Source" under the View menu (at least on some browsers) you will be able to see the source HTML.

Here is a table of some of the more useful tags. For more information look at one of the dozens of HTML tutorials on the web.

One thing that might occur to you is that it would be hard to include '<' or '>' symbols in the text, because they would be confused with tag indicators. There is an escape sequence to handle this: '&' followed by a name followed by ';'. The name for '<' is lt and the name for '>' is gt. There are dozens of names for special symbols (math, Greek letters, etc.). For characters not assigned names you can use #number, where number is the Unicode encoding of the character.

Programming Exercises

Problem 1: Parsing HTML

The BNF for describing an HTML document is:

doc         = open [doc]... close | text | selfClosing
open        = '<' tag [attr]... '>'
close       = '<' '/' tag '>'
selfClosing = '<' tag [attr]... '/' '>'
tag         = anyWord
attr        = attName '=' attValue
attName     = anyWord
attValue    = anyWord | '"' [nonQuote]... '"'
text        = nonLT [nonLT]...
anyWord     = alphanum [alphanum]...

where alphanum is any alphanumeric character (function isAlphaNum tests this), nonQuote is any character except the double quote character '"', and nonLT is any character except '<'. You may assume that there are no spaces between the '<' and the tag in "open" and "selfClosing", no spaces between the '<' and '/' in "close" and the '/' and '>' in "selfClosing", and no spaces around the "=" in "attr".

You are to implement a parser to parse HTML documents according to the grammar given above. Something that we cannot express in the grammar that you should verify is that when parsing a doc as "open [doc]... close" the tag in the open should be the same as the tag in the close. If the names do not match the parse fails.

You will save the parsed HTML in a tree-like structure. The following declarations will create this structure:

type Tag  = String
type Attr = (String,String)
data Node = Element Tag [Attr] [Node]    
          | Text String 
            deriving (Show, Eq)

There are basically two types of nodes. The first is an Element, which is created by either an "open [doc]... close" or a selfClosing. It has the tag name, a list of attributes, and a list of subnodes. These subnodes comprise everything between the open and close that are parsed by the "[doc]...". You may think of this list of subnodes as the children of the Element in the parse tree. For selfClosing this list is empty.

The other type of node is Text. This holds a string that actually appears on the web page. (Elements, on the other hand tell something about how to display things.) Note that Text includes everything between a '>' and the following '<'.

Consider the following simple example: "<html><h1>Introduction</h1><p><b>Hi</b> there, CS 8 student!</p> <h1>Conclusion</h1><p>Bye</p></html>" It will parse into the following structure (formatted for easier reading):


Just (
Element "html" [] [
    Element "h1" [] [
        Text "Introduction"
    ],
    Element "p" [] [
        Element "b" [] [
            Text "Hi"
        ],
        Text " there, CS 8 student!"
    ],
    Element "h1" [] [
        Text "Conclusion"
    ],
    Element "p" [] [
        Text "Bye"
    ]
],
"")

The root of the tree is the "html" tag. It has four children: an "h1" header, a "p" paragraph, a second "h1" header, and a second "p" paragraph. Each child except the second has a Text item as its only child. The second child has two children: a "b" bold tag (which has a Text child) and a Text child. The parse was successful, so there is a Just with the parse as the first element and an empty remaining string.

Implement a doc function that parses HTML. The BNF grammar should tell you what functions are needed and how to construct them. Import your solution to SA 8 or the one that I wrote (ParserX.hs) and use the parsing framework provided. (The sample solution won't be available until SA 8 is turned in!) I strongly suggest implementing the functions implied by the BNF grammar bottom up, and testing each one before proceeding to the next. You should eliminate newline characters immediately before open, after close, and around selfClosing items. (Don't eliminate spaces and tabs, though.) This prevents the '\n' characters that end lines from being turned into unnecessary Text nodes containing only newline characters.

What to Turn In

Turn in your code for the parser. Show tests on individual functions showing that they work correctly even in boundary cases (empty strings, length 1 strings, etc.). Include cases where the parser correctly returns Nothing as well as cases where it succeeds. Show what your parser does on the HTML above and on the file testing.html. Use the following pretty-printing code to print out the structures in a form that is readable. It will print in a form similar to the one above. Your module must import IO for this to work.

-- Pretty-print the parsed HTML
pp   :: Node -> IO ()
pp nd = ppHelp 0 nd where
  -- Compute amount to indent
  tab       :: Int -> String
  tab indent = take (4*indent) (repeat ' ')
  
  -- indent is the indentation level and a node to be pretty-printed follows.
  ppHelp          :: Int -> Node -> IO ()
  ppHelp indent (Text str) = putStrLn (tab indent ++ "Text " ++ show str)
  ppHelp indent (Element tag attrs ns) = 
    do putStrLn (tab indent  ++ "Element " ++ show tag ++ " " 
                  ++ show attrs ++ " [")
       mapM_ (ppHelp (indent+1)) ns
       putStrLn (tab indent ++ "]")

Problem 2: Operations on parsed HTML

The obvious thing to do with a parsed web page would be to display it. However, that is not very practical for a homework problem. Therefore we will ask you to gather information about the structure and to create modified structures. These functions will have form similar to the pp function above. They will each determine what to do at Text nodes. When processing an Element node they will recursively map over all of the children, somehow combining these results, and will figure out what to do with the Element node itself.

Count words

Write a function wordCount that counts the number of words of text in the HTML document. It should not count words such as tags and attributes, only words in Text nodes. To make life easy and to make the word count uniform, use the "words" function in Standard Prelude to break strings into words.

Make a list of all words

Write a function extractWords that returns a list of all words that appear in Text nodes in nodes in the HTML document (again, as determined by the "words" function in Standard Prelude). They should appear in the order that they would appear on the web page. Do not eliminate duplicates.

Make a list of all URLs that appear in the document

Write a function extractURLs that returns a list of all URLs that appear in the document. URLs are associated with "href" and "src" attribute names in attribute lists.

Go from the parsed form back to an HTML document

Write a function toHTML that restores the document to a string of valid HTML. Don't worry about spacing on the page. It is OK to have a self-closing tag like <hr/> converted into opening and closing tags <hr></hr>. The pp function above may give you some ideas.

Redact a document

Suppose that you are hired by an ISP that wants to do more sophisticated parental controls than simply banning web sites. It wants to analyze the content of web pages and to replace all objectionable text by the word "REDACTED" while displaying non-objectionable text.

Write a function redactHTML that takes two parameters: a parsed HTML document and a predicate isObjectionable that returns True if a string contains objectionable material. You should use this predicate to test each Text item in the data structure and replace those that are found objectionable by the string "REDACTED". Note that you either keep all of the text or replace all of the text in a given Text item. You will not have to worry about eliminating only a part of a Text string.

Test your redactHTML by writing a function fourLetter that returns True if any word in the string has four letters. Use the word function to break the string into words. (Note that in doing this punctuation marks like periods will be included in the lengths of words. That is OK.)

Center all h2 headers

Write a function centerH2s that encloses every "h2" element within a "center" element.

What to Turn In

Turn in your code and test runs that show that the functions work correctly. Run each of the functions on small test cases and on the file testing.html. For the centerH2s function pretty-print the results. You might try using toHTML and writeFile to write the file back out and open it in a browser, to see that you can use this representation to modify HTML in useful ways.

Extra Credit

  1. Extracting a common function: All of the functions described above do basically the same thing: they do something to the Text nodes, they recursively map over the children in an Element node and combine the results somehow, and they do some computation on the Element itself. Write a function walkNodes that takes a textf function that is applied to Text nodes, a childf function that is applied to the list derived from mapping the walkNodes function recursively over the children, and an elementf function that is applied to each Element using the newly processed children. Show how to re-write as many of the functions above as you can by supplying functions as parameters to the walkNodes function. If you do this, you can just turn in these versions of your functions.

  2. Remove comments: HTML allows comments beginning with "<!--" and ending with "-->". Right now comments would mess up your parser. Write a parser that will serve as a pre-processor to your parser that removes all comments from a string.

  3. De-bolding: Write a function that removes all "b" tags from a document, but keeps the text that was previously bold. Note that this function must return a list of nodes rather than a single node. A bold tag could have a sequence of items within it, so when it is removed there will be a list of nodes left rather than a single node.

  4. More discriminating ways of dealing with objectionable text: Replacing a text item by "REDACTED" is a rather blunt way of dealing with objectionable text. Find a more clever approach. For example, you might have a list of objectionable words with acceptable equivalents, and process the text by replacing each objectionable word by its equivalent.

  5. Creating a table of contents: Go through the parsed HTML and associate with the ith "h2" tag an anchor with name attribute "menu_i" (where i runs from 1 to the number of "h2" tags). Then create a separate node structure that is a table of contents. This should be a "ul" node. Each item within the list should be an anchor tag. The text in the ith list item is the same as the text in the ith "h2" tag. The "href" attribute in the ith list item is "#menu_i". Return a pair consisting of the modified parsed HTML (with anchors added) and the "ul" node where each list item is an anchor linking to a corresponding anchor near the appropriate h2 node.

  6. Nicely outputting HTML: I asked you to reconstruct the HTML that is represented in the parse tree, but didn't require you to make it neat. Use some of the ideas from the pp function to create a string containing nicely indented HTML. You might also want to find a way to restore a self-closing tag to self-closing form rather than two tags. You could keep a list of self-closing tags, or assume that a tag with no children is self-closing, or modify the data structure to keep track of which tags were self-closing.

What to Turn In

If you choose to do one or more of the extra credit problems, turn in your code and test data showing that the functions work. Show the results of your functions on testing.html. If you implemented the alternate grammar that I emailed to you that deals with whitespace in a better way, then show your results on testing2.html. This file has indented HTML.

The rubric for grading can be found here.

This cheat sheet for parsers may also be useful.



Department of Computer Science, Dartmouth College, Hanover, New Hampshire, USA

Last modified: 17-November-2009