Topic: Recursive Functions and Pattern Matching Date: Jan. 7, 2009 Number: 3 Examples: recursiveExamples.hs, dna2proteins.hs, wordLength.hs -- Built-in functions: There are many built-in functions provided for you, e.g., + -- addition abs -- absolute value - -- subtraction sqrt -- square root * -- multiplication exp -- exponentiation / -- division log -- logarithm ^ -- raise to integral power on Ints and Integers: div, mod (also quot, rem) Difference - on negative numbers. div rounds down (so div 5 (-2) => -3) while quot rounds toward 0 (so quot 5 (-2) => -2). mod and rem are then defined so that (div x y) * y + (mod x y) = x and (quot x y) * y + (rem x y) = x sin, cos, tan, asin, ... -- Trig functions See p. 342 for list of numerical functions. -- Recursion in Haskell Instead of using conditional expressions, we often use pattern matching. The natural recursion for a list is to handle the first item directly and the rest of the list recursively! -- Computes the length of a string strLength :: [Char] -> Int strLength [] = 0 strLength (c:rest) = 1 + strLength rest Recursive idea simple - either empty (length 0) or is a list with first character and the rest. So 1 plus strLength rest. [] matches empty string (c:rest) matches a character (c) and the rest of the list (rest). Sort of like saying: c = head lst rest = tail lst where lst is a parameter. But all done at once! This sort of matching can be done on constructors. Here the constructor is ":". The pattern match says, "Call the LHS of the last list construction operation c and the RHS of the last list construction operation rest." Note - Function calls bind tighter than operators! So don't need () around strLength rest Note - MUST have () around the (c:rest) pattern on the left. -- listSum implemented recursively (from SOE) listSum :: [Double] -> Double listSum [] = 0.0 listSum (num:rest) = num + listSum rest Same idea, except instead of 1 we add in the value of num, the first item in the list. -- Reverse a string Look at reverse function. Natural way to think of this is to take the first item of the list, put it at the end, and reverse what is left. Saw in CS 5. This leads to a straightforward definition: reverseString :: [Char] -> [Char] reverseString [] = [] reverseString (x:xs) = reverseString xs ++ [x] Note : Book likes to use x as first item of a list and xs as the rest of a list. We will often follow this convention. Note: ++ concatenates two lists, making one list out of them. We also say the second list is appended to the first. Note the difference from the ":" list constructor, which combines a single element with a list of elements. Unfortunately, this takes O(n^2) time: concatenate to end of lists of length 0, 1, 2, ..., n-1. This is because the only way to get to the end of the linked list representing a the list is to walk all the way to the end. (No tail pointer, and even if you could lists are immutable, like Java Strings. Draw linked list to show what happens.) How can we do better? Reverse is easy with a stack - pile stuff up on the top, end up with a stack in reverse order that things came in. So use this - create an auxilary function with a parameter to accumulate the reversed list. reverseStringFast xs = rev [] xs where rev acc [] = acc rev acc (x:xs) = rev (x:acc) xs Keep taking first thing off the list and putting on front of acc, returning acc when done. O(n). Demo by hand. NOTE: "where". Can be used when defining a function. Of form: functionName parm1 parm2 ... = expression where bindings Alternate: "let" let bindings in expression Can use anywhere (e.g. after an "="). Allows you to create local names and bind them to values (function definitions are also values). "rev" can't be seen outside of the function. Show not defined at top level. Also, indentation is significant! Can do Haskell with ";" at end of statements, "{}" surrounding things like stuff inside where or let. But it is postion-sensitive, with the rules: 1) If you have a statement, lines indented further are part of the same statement. Ones indented same or less start new statement. 2) If you have where, let, do, ... that allow multiple statements, the first word after the where, let, ... defines the indentation level for the list of statements. The first line indented less ends the list. HINT: DO NOT use tabs in your Haskell programs. The number of spaces shown in the editor may be different than the number assumed by the compiler, so groupings can get messed up. This idea of using an auxillary helper function with and extra parameter as an accumulator is one we will see often. -- Convert DNA sequence to amino acids DNA has 4 bases: A, C, T, and G. These are used to encode proteins. Each possible triple for bases encodes an aminio acid (except for "TAA", "TAG", and "TGA" which encode nothing). Most amino acids can be encoded in multiple ways. We will use a lookup table, called an "association list" or "a-list". It is a list of (key, value) pairs. There is a builtin function lookup that will do this lookup, which we will see later. So we need to take triples of bases, look them up in the list, and put the corresponding amino acid in the output list. How get a triple? Pattern match the first three items: -- Converts a sequence of bases in DNA to a sequence of amino acids dna2proteins :: [Char] -> [Char] dna2proteins (b1:b2:b3:rest) = lookupSure [b1, b2, b3] codes : dna2proteins rest where codes :: [([Char], Char)] codes = [("ATA", 'I'), ("ATC", 'I'), ("ATT", 'I'), ("ATG", 'M'), ("ACA", 'T'), ("ACC", 'T'), ("ACG", 'T'), ("ACT", 'T'), ("AAC", 'N'), ("AAT", 'N'), ("AAA", 'K'), ("AAG", 'K'), ("AGC", 'S'), ("AGT", 'S'), ("AGA", 'R'), ("AGG", 'R'), ("CTA", 'L'), ("CTC", 'L'), ("CTG", 'L'), ("CTT", 'L'), ("CCA", 'P'), ("CCC", 'P'), ("CCG", 'P'), ("CCT", 'P'), ("CAC", 'H'), ("CAT", 'H'), ("CAA", 'Q'), ("CAG", 'Q'), ("CGA", 'R'), ("CGC", 'R'), ("CGG", 'R'), ("CGT", 'R'), ("GTA", 'V'), ("GTC", 'V'), ("GTG", 'V'), ("GTT", 'V'), ("GCA", 'A'), ("GCC", 'A'), ("GCG", 'A'), ("GCT", 'A'), ("GAC", 'D'), ("GAT", 'D'), ("GAA", 'E'), ("GAG", 'E'), ("GGA", 'G'), ("GGC", 'G'), ("GGG", 'G'), ("GGT", 'G'), ("TCA", 'S'), ("TCC", 'S'), ("TCG", 'S'), ("TCT", 'S'), ("TTC", 'F'), ("TTT", 'F'), ("TTA", 'L'), ("TTG", 'L'), ("TAC", 'Y'), ("TAT", 'Y'), ("TAA", '_'), ("TAG", '_'), ("TGC", 'C'), ("TGT", 'C'), ("TGA", '_'), ("TGG", 'W')] dna2proteins _ = [] -- Note 1 or 2 bases at end not converted. NOTE - "_" matches anything. So order of the two definitions is important! Use "_" when you won't use the matched value. -- Looks up a key in a list of (key, datum) pairs. -- Throws an exception if the key is not found. lookupSure :: [Char] -> [([Char], Char)] -> Char lookupSure str ((key, datum) : rest) = if str == key then datum else lookupSure str rest lookupSure str _ = error (str ++ " not found") Note: error throws an exception , returning string that follows it. Also, ++ works for string concatenate, as in Java. But also means "append these two lists into a single list". (Remember that a string is nothing but a list of characters.) Have "if predicate then value1 else value2" conditional expressions in Haskell. Note how pattern matching simplifies taking apart the list and the ordered pair! --- Finding average word length in a piece of text. (Will do more clever things later.) What is needed? Want to 1) Break into words 2) Compute the length of each word 3) Sum the lengths 4) Compute the average length To break into words, when have punctuation, etc., we will: a) replace newline '\n', return '\r', tab '\t' by spaces. whitespace2spaces does this. b) eliminate all non-letters (keep spaces) filter on isLetterOrSpace c) Accumulate word until get to space; make list of words. -- Replaces newlines and tabs by spaces whitespace2spaces :: [Char] -> [Char] whitespace2spaces ('\n' : rest) = ' ' : whitespace2spaces rest whitespace2spaces ('\t' : rest) = ' ' : whitespace2spaces rest whitespace2spaces ('\r' : rest) = ' ' : whitespace2spaces rest whitespace2spaces (ch : rest) = ch : whitespace2spaces rest whitespace2spaces [] = [] -- Returns true if the character is a letter or a space isLetterOrSpace :: Char -> Bool isLetterOrSpace ch = elem ch "abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ" NOTE: elem determines if an item appears in a list. Could have written our own, but... Also, could have written whitespace2spaces more briefly as an if expression with an elem call as its predicate. How? -- Breaks the input string into a list of words breakWords :: [Char] -> [[Char]] breakWords str = breakHelper [] (filter isLetterOrSpace (whitespace2spaces str)) -- Breaks the input string into words using an extra parameter. -- The first parameter is the word accumulated so far, -- the second is the remainder of the input text. -- Returns a list of words -- This version is not as efficient as it could be breakHelper :: [Char] -> [Char] -> [[Char]] breakHelper [] [] = [] breakHelper word [] = [word] breakHelper [] (' ' : rest) = breakHelper [] rest breakHelper word (' ' : rest) = word : breakHelper [] rest breakHelper word (letter : rest) = breakHelper (word ++ [letter]) rest Somewhat tricky. Idea - if see a letter, add to current word at end. If see a space, add current word to returned list and start over. But if there is no current word, must have multiple spaces, so skip. -- Computes the average word length of a piece of text. averageWordLength :: [Char] -> Double averageWordLength str = totalLength / intToDouble (length wordList) where wordList = breakWords str totalLength = intToDouble (foldl (+) 0 (map length wordList)) NOTE: wordList is computed once, used two places. Note that expressions in where or let can refer to other expressions in where or let. Can even be mutually recursive functions defined this way! -- Converts an Int to a Double intToDouble :: Int -> Double intToDouble n = fromInteger (toInteger n) Here is a case where you would like Haskell's type system to be more forgiving. Can't do / on a pair of Integers or Ints, so have to convert. Will see later why this works, but each number type has a fromInteger which converts, and type system smart enough to figure out what is needed.