Topic: Composition, Motifs Date: Oct. 7, 2009 Number: 8 Examples: moreHOF.hs, Motifs.hs Reading: Chap. 6 PS 2 due on Wed. SA 5 due on Friday -- Composition and Application Much of the power of functional programs comes from the ability to string a bunch of simple functions together to get something complicated. The "." operator simplifies the notation for this. It avoids lots and lots of nested parentheses, and lets you think about composition in a different way. Combines with currying in interesting ways! (.) :: (b ->c) -> (a -> b) -> a -> c (f.g) x = f (g x) Note that the two functions composed must have a single parameter each! Example of composition from PS 1: Snowflake.hs vs SnowflakeComp.hs Look at the two programs. They are statement-by-statement equivalent, but some statements look rather different! The most dramatic is: drawInWindow w (withColor color (polygon (takeEvens (tail pts)))) Note that this can be viewed as: ((drawInWindow w) . (withColor color) . polygon . takeEvens . tail) pts But binding functions to parameters is stronger than operator binding, so: (drawInWindow w . withColor color . polygon . takeEvens . tail) pts Finally, the "$" is the Apply operator, which says to apply the function to whatever comes after it. It is defined: ($) :: (a -> b) a b f $ x = f x It has very low priority and associates from the right. Using this, we can say: drawInWindow w . withColor color . polygon . takeEvens . tail $ pts See also: let vertices = getVertices center radius drawSmallerFlake vertex = drawSnowflake w (radius/3) colors vertex in do drawStar w vertices color sequence_ (map drawSmallerFlake vertices) This can be written: let vertices = getVertices center radius in do drawStar w vertices color sequence_ . map (drawSnowflake w (radius/3) colors) $ vertices Another example: Consider SA1 example of distance, where we did a long string of nested operations: distance :: [Double] -> [Double] -> Double distance p1 p2 = sqrt (foldl (+) 0.0 (map (^2) (zipWith (-) p1 p2))) This could be viewed as: distance' p1 p2 = (sqrt . (foldl (+) 0.0) . (map (^2))) (zipWith (-) p1 p2) Note (zipwith (-)) p1 p2 is a 2-parameter fuction, so CANNOT compose it. Function application binds stronger than operators, so don't need parentheses. Using "$", the Apply operator, we can say: distanceC p1 p2 = sqrt . foldl (+) 0.0 . map (^2) $ zipWith (-) p1 p2 Also possible to use multiple applications rather than composition: distanceA p1 p2 = sqrt $ foldl (+) 0.0 $ map (^2) $ zipWith (-) p1 p2 If you really want to go wild with composition, could write: distanceC' p1 p2 = sqrt . foldl (+) 0.0 . map (^2) . zipWith (-) p1 $ p2 but this strikes me as less clear, because it hides the symmetric role of p1 and p2 in the expression. Even worse would be to use currying simplification: distanceC'' p1 = sqrt . foldl (+) 0.0 . map (^2) . zipWith (-) p1 -- Perimeters of Shapes Book presents a module to compute the perimeters of shapes. For ellipse uses an iterative process. Look at it; we may discuss it later. But we will do a different application instead. -- Motifs Geneticists want to find genes in chromosomes, which can be viewed as strings of DNA bases a, c, g, and t. Suppose you have a bunch of chromosomes from different animals, all of which are known to perform a certain function. The suspicion is that they all have some gene (pattern of base elements) in common. So what you might want to do is to find a common substring in all the chromosomes, and to postulate that this is the common gene. Unfortunately, genes can vary. For example, we saw that different triples of DNA can encode the same amino acid. But the genes are still very similar. So what biologists want is to find a pattern with some variations allowed. These patterns are called motifs. For each position in the motif there is a score (which could be 0) for each possible letter (a, c, g, t). These scores add up to 1, so they can be viewed as the probability that the given letter occurs in that position. The "score" of a certain substring that is being tested against the motif is the product of the probabilities that the each letter appears with its corresponding value. For example, if m is a 2-character motif (we say its width is 2), then it might be: m = [[('a', 0.5),('c', 0.3),('g', 0.1),('t', 0.1)], [('a', 0.3),('c', 0.1),('g', 0.1),('t', 0.5)]] Each list within the motif is an association list. It has a list of keys, each of which is associated with a value. You look up the key and get back the value. A data structure that supports insertion and this sort of lookup operation is called a map (NOTHING to do with the map FUNCTION). We will see it a lot this term. (Note - the adjacency list in PS 2 is also an association list.) If we have a string s: s = "tatct" we can test the motif against each postion in the string to see where it scores the highest: position substring product score 0 ta 0.1 * 0.3 0.03 1 at 0.5 * 0.5 0.25 2 tc 0.1 * 0.1 0.01 3 ct 0.3 * 0.5 0.15 4 Won't fit 0 So position 1 best matches the motif for this string, but 3 not bad. (LEAVE ON BOARD.) How can we use Haskell to score motifs and find the best match? First, consider simpler problem of finding if an exact string match between a pattern and some substring. What do we do to find if the substring is a prefix of a second string? If it is, the heads of the arguments will be the same and the tail of the first argument will be a prefix of the tail of the second argument. -- Tests to see if first argument is a prefix of the second argument -- (There's actually already a version in List, but good to show.) isPrefix :: String -> String -> Bool isPrefix [] _ = True isPrefix _ [] = False isPrefix (x:xs) (y:ys) = x==y && isPrefix xs ys To see if substring occurs anywhere, test -- Test to see if first argument is a substring of the second argument isIn :: String -> String -> Bool isIn _ [] = False isIn m s = isPrefix m s || isIn m (tail s) Having warmed up, lets figure out how to calculate a score. type Score = Double type ScoreDistrib = [(Char, Score)] type Motif = [ScoreDistrib] m :: Motif m = [[('a', 0.5),('c', 0.3),('g', 0.1),('t', 0.1)], [('a', 0.3),('c', 0.1),('g', 0.1),('t', 0.5)]] s = "tatct" Look up a value in the association list. We require that every letter be defined in the Motif, so don't need to worry about failing to find it. scoreLetter :: Char -> ScoreDistrib -> Score scoreLetter l ((l2,v):lvs) = if l == l2 then v else scoreLetter l lvs scoreLetter l [] = error "lookup failed" BTW, that was really a special case of the more general function lookupSure that we saw in Lecture 3 when converting DNA to proteins. So look up each letter and multiply. First attempt at scoring: score each letter in s against distribution at matching position in m. Use zip to match characters to score distributions for the corresponding place. score1 :: Motif -> String -> Score score1 m s = product (zipWith scoreLetter s m) Note - works as long as the string is at least as long as the motif. -- First attempt at seeing how well the motif matches at different -- places in s: march down s and score each suffix scores1 :: Motif -> String -> [Score] scores1 m s = mapTail (score1 m) s -- Like map, but give the function the whole tail rather than a -- single element. -- I don't think anything like this is predefined, but it's generally -- useful. mapTail :: ([t] -> a) -> [t] -> [a] mapTail _ [] = [] mapTail f l@(x:xs) = f l : mapTail f xs Note the use of @. It lets you bind to an argument twice, once with the pattern before the @ and once with the pattern afterwards. So l is the whole list, x = head l, xs = tail l. Problem with this approach: the last few tails are shorter than the motif. Then zipWith will quit early, and not match the whole motif. Demo "scores1 m s" and show that the last entry is wrong. One fix is to give a 0 score if the string is too short to match the motif. score [] _ = 1 score _ [] = 0 score (sc:scs) (l:ls) = scoreLetter l sc * score scs ls scores m s = mapTail (score m) s Demonstrate that "scores m s" gives the list of values in the table that we computed above. ----------------------------------------------------------- -- Computing a motif from a set of strings If we know where the motif begins in each string, we can figure out the frequencies and thus the probabilities, based on the sample. Let's look at a list of strings and the motif starting at postion 0 in each. type MotifPosition = (String,Int) -- Simple demo ss = ["aactc","taagc","ggaaa","atcaa"] s2ps :: [MotifPosition] s2ps = zip ss [0,0,0,0] So what would the motif of width 3 be? The four substrings to match are: ["aac", "taa", "gga", "atc"] (Note - what if the first position were 1 instead of 0?) First position (location 0 in each string): a, t, g, a Second postion: a, a, g, t Third position: c, a, a, c Count number of appearances of each letter, divide by number of letters. So the correct motif should be: [[('a',0.5),('c',0.0),('g',0.25),('t',0.25)], [('a',0.5),('c',0.0),('g',0.25),('t',0.25)], [('a',0.5),('c',0.5),('g',0.0), ('t',0.0)]] So how did we do this? We took the appropriate length substrings out of each string (starting at the right place!) to get what is really a list of lists of characters [[Char]]. We matched them up position by position (called a transposition of a matrix - the columns become rows), and then counted the frequencies of each letter in the new string. How do we do a transpose? -- Transpose of a list of lists (lol) -- (There's actually already a version in List, but good to see this.) mytranspose ([]:_) = [] mytranspose lol = map head lol : (mytranspose . map tail $ lol) NOTE - the base condition is a bit strange. It says that it is dealing with a list, the first item of which is the empty list. (The rest will be empty lists, also, if all rows have the same length.) ---- GOT TO HERE - EXPLAINED IDEA ---- This function is a bit more complex than the description above - it computes the motif of a list of strings ss which may be a PROPER SUBSET of the strings in s2ps (strings to positions). We eventually want to find a list of MotifPositions: type MotifPosition = (String,Int) The string is a input string and the Int is where the motif begins in that string. -- Find the motif of width w in list of strings ss, starting the motif -- in a string at the position given in s2ps. -- Note that ss can be a proper subset of the strings whose -- MotifPositions are in s2ps. motifAt :: [String] -> [MotifPosition] -> Int -> Motif motifAt ss s2ps w = map computeScoreDistrib . mytranspose $ map (\s -> substring (lookupSure s s2ps) w s) ss Take and drop are define in Standard Prelude. "take n xs" returns the first n items in list xs, "drop n xs" returns what is left after the first n items in xs are removed. -- Substring of length l starting at position p substring :: Int -> Int -> String -> String substring p l = take l . drop p The counting is a bit tricky. We want counts for ALL letters in the alphabet, including those that don't actually appear in the given position. So we do this by appending a list of all letters in the alphabet to the actual items that appear in a given position. We then count each, but reduce the count by 1 for the letter we added. Sort is a built-in function on lists. -- Compute the distribution of letters in the string "instances" to -- get the probability that each appears. -- Note that "acgt" is appended to instances so that every letter in -- the alphabet will appear in the final score distribution (even if -- its count is 0) computeScoreDistrib :: String -> ScoreDistrib computeScoreDistrib instances = let counts = count . sort $ "acgt" ++ instances total = intToScore . length $ instances in map (\(l,c) -> (l, intToScore c / total)) counts Note the use of span below. Span takes a list and breaks it into 2 lists, based on a predicate. As long as the predicate is true, things are put into the first list. Once it is false, that item and the rest are put into the second list. -- Get the count of each letter. Note that it actually reports one less -- than the number of letters, to compensate for appending the alphabet -- to instance in computeScoreDistrib -- Combine a "run" of equal elements into (element, count) -- Based on problem 9 from 99 problems count :: (Eq t) => [t] -> [(t, Int)] count [] = [] count (x:xs) = let (first,rest) = span ((==) x) xs in (x, length first) : count rest -- modification of intToFloat from SOE intToScore n = fromInteger (toInteger n) ------------------------------------------------------------------------ What if we don't know the motif positions? Guess and improve. Start with motifs being at postion 0 in each string. For each string s, remove it from the list. Compute the motif using the rest of the strings (with their positions). (This is why motifAt needed to be able to have ss be a subset of s2ps.) Find the best score for that motif in s, and then update the s2ps with this new position value. Repeat until the positions don't change, or you have done it long enough (in case of loops, etc.). Not guaranteed to find the optimal solution (by whatever measure you use - one might be maximize the minimum score, another the product of the scores). Can get "stuck" in a local optimum position. We are "hill climbing" - always go uphill until any move takes you downhill. But if start on slope of a hill will find the top of that hill rather than the top of a nearby mountain! This function takes a lot of thought, but is not that bad once you figure it out. Note that it iterates via recursive calls. The foldl handles all of the updates in one round. Trace through what it does (high level). On each recursive call, the function first calls: > leaveOneOut ss For our example this gives: [("aactc",["taagc","ggaaa","atcaa"]), ("taagc",["aactc","ggaaa","atcaa"]), ("ggaaa",["aactc","taagc","atcaa"]), ("atcaa",["aactc","taagc","ggaaa"])] Then within the foldl it does: moveBest s ss s2ps w This starts by computing the motif without the first item: > let ma = motifAt ["taagc","ggaaa","atcaa"] s2ps 2 [[('a',0.3333333333333333),('c',0.0),('g',0.3333333333333333),('t',0.3333333333333333)], [('a',0.3333333333333333),('c',0.0),('g',0.3333333333333333),('t',0.3333333333333333)]] Scoring the first string with this motif gives: > scores ma "aactc" [0.1111111111111111,0.0,0.0,0.0,0.0] So position 0 is still the best position for the first string. It next leaves out the second item: > let ma2 = motifAt ["aactc","ggaaa","atcaa"] s2ps 2 giving [[('a',0.6666666666666666),('c',0.0),('g',0.3333333333333333),('t',0.0)], [('a',0.3333333333333333),('c',0.0),('g',0.3333333333333333),('t',0.3333333333333333)]] Scoring the second string with this motif gives: > scores ma2 "taagc" [0.0,0.2222222222222222,0.2222222222222222,0.0,0.0] Now position 1 is better than postion 0 (2 is equally good), so we > let s2ps2 = replace "taagc" 1 s2ps giving [("aactc",0),("taagc",1),("ggaaa",0),("atcaa",0)] Using this on the list without the third string computes: > let ma3 = motifAt ["aactc","taagc","atcaa"] s2ps2 2 giving [[('a',1.0),('c',0.0),('g',0.0),('t',0.0)], [('a',0.6666666666666666),('c',0.0),('g',0.0),('t',0.3333333333333333)]] Scoring the third string with this motif gives: > scores ma3 "ggaaa" [0.0,0.0,0.6666666666666666,0.6666666666666666,0.0] Position 2 is now best for the third string, so we > let s2ps3 = replace "ggaaa" 2 s2ps2 giving [("aactc",0),("taagc",1),("ggaaa",2),("atcaa",0)] Finally, leaving out the fourth string we compute: > let ma4 = motifAt ["aactc","taagc","ggaaa"] s2ps3 2 giving [[('a',1.0),('c',0.0),('g',0.0),('t',0.0)], [('a',1.0),('c',0.0),('g',0.0),('t',0.0)]] Scoring the fourth string using m4 gives: > scores ma4 "atcaa" [0.0,0.0,0.0,1.0,0.0] So we finally > let s2ps4 = replace "atcaa" 3 s2ps3 giving [("aactc",0),("taagc",1),("ggaaa",2),("atcaa",3)] It turns out that further computation will not change things, so we return s2ps4 and the motif: > motifAt ss s2ps4 2 [[('a',1.0),('c',0.0),('g',0.0),('t',0.0)], [('a',1.0),('c',0.0),('g',0.0),('t',0.0)]] Let's look at how the code does this. -- Returns where the motif is in each string, along with the motif definition. -- ss is the list of strings, w is the width of the motif, and maxiter is the -- maximum number of interations that the function will perform. findMotifDet :: [String] -> Int -> Int -> ([MotifPosition],Motif) findMotifDet ss w maxiter = let loos = leaveOneOut ss iter s2ps niter = let s2ps2 = foldl (\s2ps (s,ss) -> moveBest s ss s2ps w) s2ps loos in if niter == maxiter || sort s2ps == sort s2ps2 then (s2ps, motifAt ss s2ps w) else iter s2ps2 (niter+1) in iter (map (\s -> (s,0)) ss) 0 -- Returns a list of pairs of (element, list with element deleted) for each -- possible choice of element. leaveOneOut :: [a] -> [(a, [a])] leaveOneOut [] = [] leaveOneOut (x:xs) = (x,xs) : (map (\(y,ys) -> (y,x:ys)) (leaveOneOut xs)) -- Given the string to update, the remaining strings, the current positions, -- and the motif width, update the position of the motif in the string moveBest :: String -> [String] -> [MotifPosition] -> Int -> [MotifPosition] moveBest s ss s2ps w = replace s (bestPos (motifAt ss s2ps w) s) s2ps -- In list of (key,value) pairs, substitute v for k's current value replace k v [] = [(k,v)] replace k v ((k2,v2):kvs) = if k == k2 then (k,v):kvs else (k2,v2) : replace k v kvs -- Find the best position for motif m in string s bestPos :: Motif -> String -> Int bestPos m s = snd . fmaximum fst $ zip (scores m s) [0..] -- Maximum of a list, according to a function applied to the members fmaximum f l = snd . foldr1 (\x y -> if fst x > fst y then x else y) $ zip (map f l) l -- or (less efficient, since recomputes f) fmaximum2 f l = foldr1 (\x y -> if (f x) > (f y) then x else y) l ------------------------------------------------------------------------ Chris's code also gives an algorithm that is not deterministic. Problem with deterministic algorithm - if one choice scores 51% and another 49%, always choose the 51%. May lead you into to a local optimum which is not the global optimum. Would like the possibility of looking at other choices that also look good. (So in our first example, want to definitely look at the 0.25 case, but 0.15 also worth considering, the other two shouldn't be high on list of considerations.) The new position for the string is a randomly chosen position, based on probablilities according to the score. Idea - compute score for each position, add them up, and choose a position with probability (postion's score)/(total score). Go for a fixed number of steps. Gives you a chance to "get out" of local optima. Can run it a number of times and take the best. We won't look at the details today, because using random numbers gets us into issues that we won't look at until later in the term. May come back then. --------- -- The actual algorithm updates by moving not to the best position, -- but rather to a position chosen randomly with probability according -- to the score. findMotifRand :: [String] -> Int -> Int -> IO ([MotifPosition],Motif) findMotifRand ss w maxiter = let loos = leaveOneOut ss s2ps = map (\s -> (s,0)) ss iter s2ps niter = if niter == maxiter then return (s2ps, motifAt ss s2ps w) else do s2ps2 <- foldM (\s2ps (s,ss) -> moveRandom s ss s2ps w) s2ps loos iter s2ps2 (niter+1) in iter s2ps 0 moveRandom :: String -> [String] -> [MotifPosition] -> Int -> IO [MotifPosition] moveRandom s ss s2ps w = do p <- randomPos (motifAt ss s2ps w) s return (replace s p s2ps) randomPos :: Motif -> String -> IO Int randomPos m s = let sc = scores m s (tot,tots) = mapAccumL (\tot e -> (tot+e, tot+e)) 0 sc in do p <- randomRIO (0::Score, tot) return (snd $ head $ dropWhile (\(t,i) -> t < p) $ zip tots [0..])