Topic: Databases Date: Oct. 26, 2009 Number: 16 Examples: Database.hs Reading: PS 3 due Wednesday x-hour (optional) review Thurs. Midterm 7:00 Thurs. -- Databases -- Originally databases were basically maps - a key with a lot of related data. They were heirarchically organized. If the question that you wanted to ask had been anticipated and the database organized so that you could easily get the information things were great. If not, there was no good way to get out the specific information that you wanted. In 1970 Codd wrote a paper about relational databases. Data bases are broken into Tables, each of which encodes information about some particular aspect of the data. Each entry is a row, and each row has a column for each type of data. There is a column header to identify the meaning of the data in the column. For example: putTable people personId name gender bday city state 1 Alice F 2000-01-01 Wonderland ?? 2 Elvis M 1935-01-08 Memphis TN 3 Delilah F 1988-04-03 Hanover NH 4 Charlie M 1900-07-14 Strafford VT 5 Bob M 1999-12-31 WRJ VT 6 Flora F 1905-05-01 Stowe VT Adding new types of data is easy - just add a new Table, which expresses some new relationship between parts of the data. putTable interests interestId personId interest 1 1 rabbit holes 2 1 tea 3 1 playing cards 4 2 singing 5 2 donuts 6 4 dairy 7 4 fiddle These tables could have one or more keys (just like maps), and keys must be unique identifiers and never appear in more than one row. In the two tables above, personId is the key to the people table and interestId is key to the interests table. (These were identified as such when the tables were created.) But personID could NOT be a key for the interests table. But how do you tie this information together and find relationships not expressed in a single table? The main new idea was the "join" function - you can find a common field in two different Tables and then create an new table whose rows append a row from the first table and a row from the second table. So if one table has Dartmouth ID numbers and names and another has Dartmouth ID numbers and term grades and a third has Dartmouth ID numbers and addresses, by joining the three tables (well, the first two and then the resulting table with the third) on Dartmouth ID number you could create a table with name, mailing address, and term grades. This is the information needed to mail out grades. We will consider two ways to do this JOIN operation. INNERJOIN innerJoin is where a column is specified in a first table which corresponds to a key in the second table. (It might be something like Dartmouth ID number or SS number or ...) For each row in the first table the value in the designated column is looked up as a key in the second table, and that row is appended to the row from table 1. Note that the final table size is the size of table 1 (unless some value from table 1 does not appear as a key in table 2, in which case no row is created and the new table is smaller than table 1. A row from table 2 may be appended to a number of different rows in table 1, but no row in table 1 is appended to more than 1 row in table 2. Example: interestsPeopleJoin = innerJoin "personId" interests people putTable interestPeopleJoin interestId personId interest personId name gender bday city state 1 1 rabbit holes 1 Alice F 2000-01-01 Wonderland ?? 2 1 tea 1 Alice F 2000-01-01 Wonderland ?? 3 1 playing cards 1 Alice F 2000-01-01 Wonderland ?? 4 2 singing 2 Elvis M 1935-01-08 Memphis TN 5 2 donuts 2 Elvis M 1935-01-08 Memphis TN 6 4 dairy 4 Charlie M 1900-07-14 Strafford VT 7 4 fiddle 4 Charlie M 1900-07-14 Strafford VT (Sorry for the long lines.) This is probably more information than you wanted! Particularly don't want "personId" appearing twice! So what can we do? Can "project" data down to certain columns. PROJECTION Joining tables can lead to long rows, especially if it is necessary to combine a large number of tables to get the necessary relational information. We would like to be able to eliminate duplicated or unneeded columns and create a table that contains only the columns that we want. Projection allows you to specify a list of column names, and it builds a table with all rows but only those columns. Example: peopleInterests = project ["name","interest"] interestsPeopleJoin putTable peopleInterests name interest Alice rabbit holes Alice tea Alice playing cards Elvis singing Elvis donuts Charlie dairy Charlie fiddle Note if pick a column that occurs twice, only copied once: putTable (project ["name", "interest", "personId"] interestsPeopleJoin) name interest personId Alice rabbit holes 1 Alice tea 1 Alice playing cards 1 Elvis singing 2 Elvis donuts 2 Charlie dairy 4 Charlie fiddle 4 Can also use project to re-arrange the order of the columns! JOIN The more general join does a cartesian product. A column is specified in each table, and all possible pairs that share a value are created. This sort of join can result in nothing (no values match) up through the product of the table sizes (every row in both tables has the same value in the specified columns - rather uninteresting, but possible). interestsInterestsJoin = join "personId" "personId" interests interests interestId personId interest interestId personId interest 1 1 rabbit holes 1 1 rabbit holes 1 1 rabbit holes 2 1 tea 1 1 rabbit holes 3 1 playing cards 2 1 tea 1 1 rabbit holes 2 1 tea 2 1 tea 2 1 tea 3 1 playing cards 3 1 playing cards 1 1 rabbit holes 3 1 playing cards 2 1 tea 3 1 playing cards 3 1 playing cards 4 2 singing 4 2 singing 4 2 singing 5 2 donuts 5 2 donuts 4 2 singing 5 2 donuts 5 2 donuts 6 4 dairy 6 4 dairy 6 4 dairy 7 4 fiddle 7 4 fiddle 6 4 dairy 7 4 fiddle 7 4 fiddle There is a problem here, though - projection is difficult, because I have two columns labeled "interest", so can't get both of them! If I project ["interest", "interest", "personId"] interestsInterestsJoin I get: interest interest personId rabbit holes rabbit holes 1 rabbit holes rabbit holes 1 rabbit holes rabbit holes 1 tea tea 1 tea tea 1 tea tea 1 playing cards playing cards 1 playing cards playing cards 1 playing cards playing cards 1 singing singing 2 singing singing 2 donuts donuts 2 donuts donuts 2 dairy dairy 4 dairy dairy 4 fiddle fiddle 4 fiddle fiddle 4 How to get around this? Create new table with the same rows, but different column headers (or at least "interest" changed). interests2 = project ["personId", "interest"] interests interests3 = makeTable ["personId2", "interest2"] (getRows interests2) Nothing commonInterests = join "personId" "personId2" interests2 interests3 putTable commonInterests personId interest personId2 interest2 1 rabbit holes 1 rabbit holes 1 rabbit holes 1 tea 1 rabbit holes 1 playing cards 1 tea 1 rabbit holes 1 tea 1 tea 1 tea 1 playing cards 1 playing cards 1 rabbit holes 1 playing cards 1 tea 1 playing cards 1 playing cards 2 singing 2 singing 2 singing 2 donuts 2 donuts 2 singing 2 donuts 2 donuts 4 dairy 4 dairy 4 dairy 4 fiddle 4 fiddle 4 dairy 4 fiddle 4 fiddle Now have different column headers, so can project if you want. Question - If given a table that pairs an actor with a movie that the actor appeared in, how could you get a table of all pairs of actors that appeared in the same movie (along with the common movie)? SELECTION It is often the case that you don't want the whole data base, only seniors or women or senior women who had WISP internships as first-year students. Relational databases let you express a predicate on the data in certain columns and to filter the table to keep only the rows for which the predicate is true. Question - how do you tell it what to select? Supply a predicate. Keep the row if the predicate is true. Problem - the predicate you want to write will probably want to refer to fields in the row. That is, you want to be able to select values corresponding to column names. But how can you do that? The data base knows about the column names. You shouldn't have to go mucking around figuring out which column position has the data you want. Solution - the select function provides the predicate with two parameters: 1) A row to be tested (keep it return true, ignore it return false). 2) A selector function, which given the row and the column header returns the value in the selected column of the row. So supply a predicate that takes 2 arguments as specified (row, selector function) and returns a value. So the type of select is: select :: SelectPred -> Table -> Table where type SelectPred = Row -> (Row -> ColumnName -> Value) -> Bool An easy example: aliceInterests = select (\r (!) -> r ! "personId" == "1") interests2 personId interest 1 rabbit holes 1 tea 1 playing cards Then can do something like: commonIntAlice = join "personId" "personId2" aliceInterests interests3 personId interest personId2 interest2 1 rabbit holes 1 rabbit holes 1 rabbit holes 1 tea 1 rabbit holes 1 playing cards 1 tea 1 rabbit holes 1 tea 1 tea 1 tea 1 playing cards 1 playing cards 1 rabbit holes 1 playing cards 1 tea 1 playing cards 1 playing cards to get only Alice's interests. (Question - why do the extra rows in interests3 not lead to any lines? The personIds only match when the person is Alice, so others are ignored.) More complicated: maleVermonters = select (\r (!) -> (r ! "state") == "VT" && (r ! "gender") == "M") people personId name gender bday city state 4 Charlie M 1900-07-14 Strafford VT 5 Bob M 1999-12-31 WRJ VT maleVermontContacts = project ["name","city"] maleVermonters me city Charlie Strafford Bob WRJ SQL We have a mini-database implementation, using names similar to the ones in SQL. Most modern databases are relational databases, so learning about them will prepare you to understand them when you come across them. Other fuctions in the database: LOOKUP in keyed table elvis = lookupKey "2" people -- Gets whole row elvisCity = lookupKeyCol "2" "city" people -- selects a column -- from the row *** Got to here in class *** -- More Join examples networks = makeTable ["networkId","network"] [["1","musicians"],["2","food lovers"]] (Just "networkId") membership = makeTable ["personId","networkId"] [["2","1"],["4","1"],["1","2"],["2","2"]] Nothing musicians = project ["name"] $ innerJoin "personId" (innerJoin "networkId" membership $ select (\r (!) -> r ! "network" == "musicians") networks) people elvisNetworks = innerJoin "networkId" (innerJoin "personId" membership $ select (\r (.) -> r . "name" == "Elvis") people) networks elvisNetworkNames = project ["network"] elvisNetworks elvisNetworkIds = map head $ getRows $ project ["networkId"] elvisNetworks elvisId = head $ head $ getRows $ project ["personId"] elvisNetworks friendPred r (!) = (elem (r ! "networkId") elvisNetworkIds) && ((r ! "personId") /= elvisId) elvisFriends = project ["name","network"] $ innerJoin "networkId" (innerJoin "personId" (select friendPred membership) people) networks ------------------------ Now look at the code. Skim most of it; concentrate on joins. module Database (Table, ColumnName, Row, Value, Key, GetCol, SelectPred, makeTable, putTable, lookupKey, lookupKeyCol, select, project, innerJoin, join, sortOnCol, getColNames, getRows, getKeyCol, getKeys) where import qualified Data.Map as Map import Data.List import Data.Ord import Data.Maybe (mapMaybe) import Control.Monad (mapM_) -------------------- -- Table definition data Table = Table {colNames ::[ColumnName], rows ::[Row], key :: (Maybe Key)} deriving Show type ColumnName = String type Row = [Value] type Value = String type Key = (String, Map.Map Value Row) -- Create a table from list of column headers, list of rows, and maybe a key makeTable :: [ColumnName] -> [Row] -> (Maybe String) -> Table makeTable cols rows Nothing = Table cols rows Nothing makeTable cols rows (Just keyCol) = case elemIndex keyCol cols of Nothing -> Table cols rows Nothing Just p -> Table cols rows (Just (keyCol, Map.fromList $ zip (map (!! p) rows) rows)) -------------------- -- Accessor functions -- Get all the column headers from a table getColNames :: Table -> [ColumnName] getColNames = colNames -- Get all data, row by row, from a table getRows :: Table -> [Row] getRows = rows -- Given a Key, get the column name that the table is keyed on getKeyCol :: Maybe Key -> Maybe ColumnName getKeyCol Nothing = Nothing getKeyCol (Just (kc,_)) = Just kc NOTE - you want the NAME, not the actual map saved in the table, because joint, projections, selections, etc. require changing what is saved. -- Given a Table, get a list of all keys in the Table getKeys :: Table -> Maybe [Value] getKeys (Table _ _ Nothing) = Nothing getKeys (Table _ _ (Just (_,mp))) = Just (Map.keys mp) -------------------- -- Printing -- Print out, with columns lined up -- (Could instead set up a Show instance, but that's a bit more -- complicated.) putTable :: Table -> IO () putTable (Table cols rows _) = let lens = map (maximum . map length) (transpose (cols:rows)) spaces = repeat ' ' tabify row = do mapM_ (\(x,l) -> putStr $ take (l+2) (x ++ spaces)) (zip row lens) putChar '\n' in mapM_ tabify (cols:rows) NOTE - lots of complicated stuff here. mapM_ allows you to map IO operations. Most of the work is figuring out column width and a way to "tab" over by adding spaces. -------------------- -- lookup in keyed Table -- Use the key to get a specific row lookupKey :: Value -> Table -> Maybe Row lookupKey key (Table _ _ (Just (keyCol,mp))) = Map.lookup key mp lookupKey _ _ = Nothing -- Use key and column name to get specific value lookupKeyCol :: Value -> ColumnName -> Table -> Maybe Value lookupKeyCol key col table@(Table cols _ _) = case lookupKey key table of Nothing -> Nothing (Just row) -> Just (row !! (posIn col cols)) -------------------- -- Select rows satisfying a predicate -- The predicate takes a row and a column accessor -- a function that -- takes a row and column name and returns the value type GetCol = Row -> ColumnName -> Value type SelectPred = Row -> GetCol -> Bool select :: SelectPred -> Table -> Table select pred (Table cols rows key) = makeTable cols srows (getKeyCol key) where getCol r c = r !! (posIn c cols) srows = filter (\r -> pred r getCol) rows -- Position of an column name in a list of names. -- Function is general, but error message is specific. posIn :: ColumnName -> [ColumnName] -> Int posIn col cols = case elemIndex col cols of Just p -> p Nothing -> error ("Column name " ++ col ++ " does not exist in this table") -------------------- -- Project rows onto particular columns project :: [ColumnName] -> Table -> Table project pcols (Table cols rows key) = makeTable pcols prows (getKeyCol key) where prows = map (getElemsAt (mapMaybe (\c -> elemIndex c cols) pcols)) rows -- Members of xs at positions is getElemsAt :: [Int] -> [a] -> [a] getElemsAt is xs = map (xs !!) is NOTE - Goes through each row, pulling out the things wanted in order. -------------------- -- Sorts the rows of the table according to the values in a given column sortOnCol :: ColumnName -> Table -> Table sortOnCol col (Table cols rows key) = Table cols (sortBy (comparing (!! p)) rows) key where p = posIn col cols NOTE - sortBy takes a "comparator" that returns LT, GT, or EQ. "comparing" applies the same function to both inputs, compares them, and returns LT, GT, or EQ. Have to build a new table, and here is an exception where I can reuse key info. (I have merely re-ordered the rows, and that does not change the contents or how the Map is set up.) -------------------- -- Two varieties of join -- Join the first table with the second, using the specified column -- name from the first column to access the key in the second innerJoin :: ColumnName -> Table -> Table -> Table innerJoin colName1 (Table cols1 rows1 key1) t2@(Table cols2 _ _) = let cols = cols1 ++ cols2 getForeignKey r = r !! (posIn colName1 cols1) rows = mapMaybe (\r -> case lookupKey (getForeignKey r) t2 of Nothing -> Nothing Just r2 -> Just (r ++ r2)) rows1 in makeTable cols rows (getKeyCol key1) -- Join on any two columns. Takes cartesian product of all lines with matching -- values in the two specified columns in the two tables. -- The function first sorts the tables to speed up the processing by merging join :: ColumnName -> ColumnName -> Table -> Table -> Table join colName1 colName2 t1 t2 = let (Table cols1 rows1 _) = sortOnCol colName1 t1 (Table cols2 rows2 _) = sortOnCol colName2 t2 pos1 = posIn colName1 cols1 pos2 = posIn colName2 cols2 joinRows :: [Row] -> [Row] -> [Row] joinRows rw1@(r1:rs1) rw2@(r2:rs2) | r1 !! pos1 < r2 !! pos2 = joinRows (dropWhile (\r -> r1 !! pos1 == r !! pos1) rs1) rw2 | r1 !! pos1 > r2 !! pos2 = joinRows rw1 (dropWhile (\r -> r2 !! pos2 == r !! pos2) rs2) | otherwise = let (eq1, rest1) = span (\r -> r1 !! pos1 == r !! pos1) rw1 (eq2, rest2) = span (\r -> r2 !! pos2 == r !! pos2) rw2 in [x ++ y | x <- eq1, y <- eq2] ++ joinRows rest1 rest2 joinRows _ _ = [] -- When either list empty, join is also in makeTable (cols1 ++ cols2) (joinRows rows1 rows2) Nothing NOTE - sort and then merge. Throw out smaller if not equal, if are equal break each row list into the equal parts and the rest, cartesian product on equal parts, append a recursive call on the rest.