Topic: Parsing Date: Oct. 30, 2009 Number: 18 Examples: Parser.hs Reading: -- Parsing -- Parsing is a major topic in CS - comes up all the time. Parsing means reading a text document and translating it into a form that is more usable to a computer (or occasionally doing something with the text directly). Consider the relative utility of the string "(9*5 - 36/3)*(2 + 4)" versus the expression tree: expr = ((C 9 :* C 5) :- (C 36 :/ C 3)) :* (C 2 :+ C 4) They represent the same expression, but the latter can easily be evaluated using the "evaluate" function in Trees.hs. The former cannot. Parsing is about getting from the former (the string representation) to the latter (the expression tree representation). Programs, web pages (HTML and kin), LaTeX documents, regular expressions in "grep", etc. all need to be read and interpreted in order to run them (programs), display them (web pages), format them (LaTeX and other document formatters), etc. How do we go about doing this? The usual way is to use a grammar. A version of a grammar for a simple subset of English: sentence = NP VP NP = Name | Det Adj Noun VP = IVerb | TVerb NP Name = "scot" | "chris" | "tom" | "linda" | "sue" | ... Det = "the" | "a" Adj = "happy" | "hungry" | "blue" | "fast" | ... Noun = "person" | "cat" | "dog" | "chair" | ... IVerb = "sits" | "jogs" | "thinks" | "sleeps" | ... TVerb = "eats" | "watches" | "hits" | "kisses" | ... Listing things one after another means that they occur in sequence. An "|" is an "or" or "choice" - choose one of them. (Sequencing has higher precedence than choice.) Terminal symbols that actually appear in the sentence (e.g. "tom" and "the") are written in quotes, non-terminals (things that need to be expanded, like Noun or IVerb) are not. This type of grammar is called BNF (Backus-Naur Form). (In standard BNF the non-terminals would be surrounded by angle brackets, as in , but we will not use this convention.) This sort of grammar is lousy for English, as any of you who ever diagrammed sentences probably learned. But works really well for computer languages, etc. Data type declarations in Haskell are basically grammars: data SimpleTree = Leaf | Branch SimpleTree SimpleTree deriving Show is very similar to the equivalent BNF for describing a SimpleTree: SimpleTree = "Leaf" | "Branch" SimpleTree SimpleTree They tell us how to build things up, which is basically how to parse them. In PS 3 we printed hierarchical clusters using parentheses to group them: ((CLX FDC) (UPS (ASN HLT))) But what if we want to build a tree back up from the sort of parenthesized output? This is a job for parsing. You will do this as part of a short assignment, but let's look at a simpler problem: parsing a SimpleTree from parenthesized output. Consider the function: parenthesizeTree :: SimpleTree -> String parenthesizeTree Leaf = "*" parenthesizeTree (Branch t1 t2) = "(" ++ parenthesizeTree t1 ++ parenthesizeTree t2 ++ ")" Running it on: tree1 = Branch (Branch Leaf (Branch Leaf Leaf)) Leaf will produce: "((*(**))*)" (Like what we did with hierclusters, but printing a Leaf as "*" instead of as a stock name.) We can have BNF for this parenthsized tree notation, also: pTree = "*" | "(" pTree pTree ")" How can we go about converting the string to the tree1? We can do this with a function. The input to the function should be a String (obviously), but what is the output? First thought: a SimpleTree. But what if the String is badly formed, and does not represent a SimpleTree? Sounds like a good place to use Maybe, and return "Just t" if the string represents the tree t and "Nothing" otherwise. This would work, but it has a problem. The straightforward way to recognize a Branch is to look for an "(", a SimpleTree, a second SimpleTree, and a ")". How do we find those SimpleTrees? A recursive call seems in order. The recursive call will parse part of the input string into a SimpleTree. We want to continue parsing the input string at the first character that was not "eaten up" by the recursive call in forming its SimpleTree. But how do we know where that is? We don't, unless we also return an updated input string. So we instead give have it return a Maybe (SimpleTree, String), where the string is what is left of the input after a SimpleTree has been parsed. Leads to something like: parseTree :: String -> Maybe (SimpleTree, String) parseTree [] = Nothing parseTree (c:cs) = if c == '*' then Just (Leaf, cs) else if c == '(' then case parseTree cs of Nothing -> Nothing Just (t1, cs1) -> case parseTree cs1 of Nothing -> Nothing Just (t2, cs2) -> if null cs2 then Nothing else if head cs2 == ')' then Just (Branch t1 t2, tail cs2) else Nothing else Nothing Demo this. Parse "(*(**))" by hand. Note that this is a lot more complicated than printing the string from the tree! Most of the problem comes from two things: 1) Having to deal with success or failure at each step. 2) Having to pass and keep track of the remaining input string. We will now look at a Parser module that makes building this kind of parser much easier, by providing functions that produce parsers. That is, they are functions that return functions, and those returned functions are then used to parse strings. Functions that take functions as inputs and modify or combine them to return new functions are called combinators. We first define a type: type Parser a = String -> Maybe (a,String) Idea - take a string, eat up some number of characters and produce something of type a, and then return that along with the remaining string. If start of string does not have an a, return Nothing. Simple examples: -- Parse a '(' parseOpenParen :: Parser Char parseOpenParen [] = Nothing parseOpenParen (c:cs) = if c=='(' then Just (c,cs) else Nothing -- Parse a ')' parseCloseParen :: Parser Char parseCloseParen [] = Nothing parseCloseParen (c:cs) = if c==')' then Just (c,cs) else Nothing Doing this over and over would get very boring. So we generalize this pattern of parsing some literal character c: lit :: Char -> Parser Char lit c [] = Nothing lit c (c2:cs) = if c2==c then Just (c,cs) else Nothing -- Example parseOpenParen2 = lit '(' Note type signature: we have written a function that takes a character and returns a FUNCTION. A very powerful idea - combine functions not to perform a task, but to create OTHER FUNCTIONS. Using this, we can modify the parseTree above to get rid of the "if" tests for "(" and ")" ------------------------------------------------------------------------ parseBinary1 :: Parser SimpleTree parseBinary1 [] = Nothing parseBinary1 cs = case lit '*' cs of Just (_,cs1) -> Just (Leaf,cs1) Nothing -> case lit '(' cs of Nothing -> Nothing Just (_,cs1) -> case parseBinary1 cs1 of Nothing -> Nothing Just (l,cs2) -> case parseBinary1 cs2 of Nothing -> Nothing Just (r,cs3) -> case lit ')' cs3 of Just (_,cs4) -> Just (Branch l r, cs4) Nothing -> Nothing ------------------------------------------------------------------------ So far this does not look like an improvement, but wait! Note takes a string, produces a tree (plus remaining input string). First tries to parse a Leaf '*'. If fails, tries to parse a Branch, which is of the form: (SimpleTree SimpleTree). So looks for '(', then a SimpleTree (recursively), then another SimpleTree, then a ')'. If succeeds, returns the tree with the remaining input string. If fails, returns Nothing. Works, but a major pain to write. All these case statements, passing on remaining strings, etc. is tedious. Build up functions that will take parsers and combine them or do other things with them to make new parsers. First thing that we saw in grammar above was sequencing, so build a function to do this. -- Sequence operation - Takes two parsers, recognizes whether a string -- starts with a substring recognized by the first followed by a string -- recognized by the second. Returns ordered pair if succeeds. infixl 6 # (#) :: Parser a -> Parser b -> Parser (a,b) (m # n) cs = case m cs of Nothing -> Nothing Just (p,cs1) -> case n cs1 of Nothing -> Nothing Just (q,cs2) -> Just ((p,q),cs2) So the case statements get put into here. Note it takes two Parsers, applies them sequentially (if first succeeds, second is applied to the remaining string passed on from the first). Returns an ordered pair of the results (along with the remaining string). Nothing if either fails. So for example, if want '(' followed by ')': parseOC cs = (lit '(' # lit ')') cs parseOC2 = lit '(' # lit ')' (Don't need the parentheses because # has precedence 6.) But so far we only have only been dealing with Chars and Strings. How do we get Trees? Or do other things to process the output of a Parser? -- Transforms the output of a parse by applying a function infix 5 >-> (>->) :: Parser a -> (a->b) -> Parser b (m >-> f) cs = case m cs of Nothing -> Nothing Just (a, cs1) -> Just (f a, cs1) Note that this is sort of like the map function, but instead of mapping a function onto each item in a list we map the function onto the output of the parser. -- An example parseTreeDup = parseTree >-> (\t -> Branch t t) Often we want to sequence things, but don't care about one part. For instance, skip white space and then find a word. The word is what we want; the white space was just an obstacle. Or '(' - once we find it we don't need to use it for anything. So create operators for throwing away first or second part (whichever has '-') (-#) :: Parser a -> Parser b -> Parser b m -# n = m # n >-> snd (#-) :: Parser a -> Parser b -> Parser a m #- n = m # n >-> fst -- An example parseParen2 = lit '(' -# lit ')' Only keeps ')' The second thing we did in grammars above was to make a choice. -- Alternatives - Finds if either succeeds. If both succeed, -- choose first. infixl 3 ! (!) :: Parser a -> Parser a -> Parser a (m ! n) cs = case m cs of Nothing -> n cs mcs -> mcs Here mcs will be a "Just" something, and we pass it on. If m fails, return whatever n would do on ORIGINAL STRING. -- An example parseParen = lit '(' ! lit ')' ------------------------------------------------------------------------ -- Now our nice clean parser, built with these components parseBinary2 :: Parser SimpleTree parseBinary2 = lit '*' >-> (\a -> Leaf) ! lit '(' -# parseBinary2 # parseBinary2 #- lit ')' >-> (\(a,b) -> Branch a b) This IS a big win over our original parser. Why? "case" statements and passing on the remaining string are done by the various parser combining functions. Note we don't handle the string anywhere! Passed on "under the hood". ------------------------------------------------------------------------ Note how much simpler this is to write (and read!) that the one above. We choose between the literal '*' and '(' SimpleTree SimpleTree ')'. If find a '*', create a Leaf using >->. If find the other, build a tree out of the two recursive calls, throw away parentheses. Can't handle things like: ((CLX FDC) (UPS (ASN HLT))) Short assignment looks at that.