Topic: Spelling Corrector Date: Nov. 20, 2009 Number: 26 Examples: spelling.hs, EditDistance.hs, big.txt In PS 5 we asked you to do "near-matches" on names. This is a similar problem to writing a spelling corrector. In this class we will solve the problem of writing a spelling corrector. What does it mean for two words or two names to be "close" to one another? There are many possible definitions, but an easy one to implement is called "edit distance". It is the number of editing operations needed to convert one word or name into the other. The operations are insert a character, delete a character, replace a character, or swap two adjacent characters (so "thier" can be converted to "their" in one operation). Here is a program that solves this recursively. It looks at the first character of s1 and s2. If they match, then finds edit distance between remaining strings. If not, try all of the possible operations. (Note that match and replace have the same sub-problem, but their contributions to the edit distance differ by one.) ----- import qualified Data.Map as Map -- Straightforward recursive implementation of edit distance. naiveEditDistance :: String -> String -> Int naiveEditDistance s1 [] = length s1 -- Delete remainder of s1 naiveEditDistance [] s2 = length s2 -- Insert remainder of s2 naiveEditDistance s1@(x:xs) s2@(y:ys) = minimum [ if x == y then matchDist else matchDist + 1, -- Match or replace naiveEditDistance xs s2 + 1, -- delete x naiveEditDistance s1 ys + 1, -- insert y swapDist + 1 -- swap ] where matchDist = naiveEditDistance xs ys swapDist = if not (null xs) && not (null ys) && x == head ys && y == head xs then naiveEditDistance (tail xs) (tail ys) else matchDist + 5 -- Won't be smallest ----- Works great, on small strings. But exponential run time growth - look at tree for "catz" and "cots". Works fine. But "Kate Blanchet" and "Cate Blanchett" takes many minutes. So what can we do? Avoid re-computing subproblems. Save in a map, look up first. If find, use. If not, solve problem and add solution to map. Concept is easy, and it is easy to implement in Scheme, where you can save a map and update its values. Psuedocode: memoize f = Create a local variable m to hold an empty map. Return the following function: fmemo params = if params is a key in the map m, return corresponding value else ans = f params update map by inserting (params, ans) return ans When computing f params use fmemo in any recursive calls. Awkward thing in Haskell - since map is immutable, must pass it in and out of every call, being updated on the way. ------ -- Memoized implementation of edit distance; quadratic number of Map -- inserts and lookups. memoEditDistance :: String -> String -> Int memoEditDistance s1 s2 = dist where (dist, mp) = eDist s1 s2 Map.empty -- Helper function -- Note that the map is passed from call to call, being updated on -- the way. eDist :: String -> String -> Map.Map (String, String) Int -> (Int, Map.Map (String, String) Int) eDist s1 [] mp = (length s1, mp) -- Delete remainder of s1 eDist [] s2 mp = (length s2, mp) -- Insert remainder of s2 eDist s1@(x:xs) s2@(y:ys) mp = case Map.lookup (s1, s2) mp of (Just dist) -> (dist, mp) -- If already computed this problem, -- used saved answer Nothing -> let (matchDist, m1) = eDist xs ys mp -- match and replace cases (delDist, m2) = eDist xs s2 m1 (insDist, m3) = eDist s1 ys m2 (swapDist, m4) = if not (null xs) && not (null ys) && x == head ys && y == head xs then eDist (tail xs) (tail ys) m3 -- swap else (matchDist + 5, m3) minDist = minimum [matchDist + (if x == y then 0 else 1), delDist + 1, insDist + 1, swapDist + 1] in (minDist, Map.insert (s1, s2) minDist m4 ) ----- Much faster - all pairs of final segments of strings get tested, but only m*n of those (where m and n are the lengths of input strings). Lookup takes O(lg n + lg m). So can use this to find the nearest name in the actor database to the input actor, or alternately all the ones with the minimum edit distance. Not always what you want - if you use a nickname for the first name the distance might be large (e.g. Bill vs William have edit distance 4). But works fairly well, but is slow going through the whole database, even with memoEditDistance. The general idea is called "dynamic programming". Will see ways to do it without using a Map in CS 25. -- Spelling Corrector Norvig worked up a quick spell corrector while on a plane ride! It is in ML, and the description is at http://www.norvig.com/spell-correct.html. Grzegorz converted it to Haskell. First, some theory. There are two things to be considered. First, how far away are you from a correctly spelled word? Edit distance is a reasonable choice here, although other measures that took into account common typos and mis-spellings would give a more accurate measure. Norvig just looks at words with minimum edit distance. Then how to pick one? Pick the one that is most common in English. How does he determine this? He creates big.txt, a file of a number of books from the Project Gutenberg (128,000 lines, about a million words) and runs statistics! Lots of Sherlock Holmes, a Russian novel, etc. So his algorithm: 1) Find all words of minimum edit distance. 2) Return the one that occurs most frequently in big.txt. Here is the implementation: import Prelude hiding (words) import Data.Char import Data.Ord import Data.Maybe import qualified Data.Map as Map import qualified Data.Set as Set import qualified Data.List as List words = List.words . map (toLower . (\c -> if isAlpha c then c else ' ')) train = List.foldl' (\dict f -> Map.insertWith' (+) f 1 dict) Map.empty edits1 word = let n = length word in Set.fromList $ [take i word ++ drop (i+1) word | i <- [0..n-1]] -- deletion ++ [take i word ++ [word!!(i+1)] ++ [word!!i] ++ drop (i+2) word | i <- [0..n-2]] -- transposition ++ [take i word ++ [c] ++ drop (i+1) word | i <- [0..n-1] , c <- ['a'..'z'] ] -- alteration ++ [take i word ++ [c] ++ drop i word | i <- [0..n-1] , c <- ['a'..'z'] ] -- insertion known_edits2 nwords word = Set.fromList [e2 | e1 <- Set.elems (edits1 word) , e2 <- Set.elems (edits1 e1) , e2 `Map.member` nwords ] known nwords = Set.intersection (Map.keysSet nwords) correct nwords word = let candidates = fromJust $ List.find (not . Set.null) [ known nwords (Set.singleton word), known nwords (edits1 word), known_edits2 nwords word, Set.singleton word ] in List.maximumBy (comparing (\c -> Map.findWithDefault 1 c nwords)) (Set.elems candidates) getCorrector = do nWORDS <- fmap (train . words) (readFile "big.txt") return (putStrLn . correct nWORDS) So about 20 lines of code, not including import statements. Let's see what it does. First, words is a list of words, mapped to lower case, all non-Alpha characters replaced by ' '. Like what we did the first week or two. He then creates a dictionary with frequency counts. Note Map.insertWith'. Also foldl'. What is the ' for? These are strict versions of the functions. Strict is the opposite of lazy. You evaluate all of your arguments, whether you need them or not. Need here to prevent blowing the stack. (Keeping around computations in progress can be larger than the resulting answers.) Map.insertWith' adds 1 each time a word is found. So train computes our frequency count. (Note currying - the list parameter not included.) Instead of computing edit distance to each word, reversed the process. Computes all things with edit distance 1, then computes all things that are edit distance 1 from them to get all things of edit distance 2. Filters that to be only words in the dictionary provided (otherwise would be lots of them.) Did some experiments that showed 99% of the time mis-spellings are within edit distance 2 of the correct word. Note way he generated all the options using list comprehensions and take and drop. He then intersects this with his list of words. (Well, actually with the keyset of the Map, so duplicates are eliminated.) Finally, he looks for the first set (edit distance 0, 1, 2, then just take the word) that contains a word. "known" just does set intersection. fromJust, List.find operations new, but easy to figure out (or look up). Finally, gets the word that appears most often. Map.findWithDefault returns 1 for the count of any word not in the dictionary. The getCorrector is interesting. Creates a method that is used: *Main> sc <- getCorrector *Main> sc "speling" spelling So use <- to assign sc to the results of getCorrector, which is a function in the IO monad. Principal type: getCorrector :: IO (String -> IO ()) So sc will be String -> IO (). (The <- strips off the outermost IO.) Note that fmap is a "Functor map" - it is a generalized version of maps within modad. We will see the type class Functor later. Final questions - how to improve the spell checker. 1) Use a better "distance model" that takes into account types of errors and their relative frequencies, common mis-spellings. 2) Take word in context. "ther cat" vs "ther is". First case likely to be "the" or "their", second "there".