COSC 8 -- Problem Solving with CS

CS 8, Fall 2009

Problem Set 3: Cluster Analysis


General Instructions

Before you do anything, please read the entire assignment carefully. It contains many suggestions which can save you a lot of struggle later on.

You should print out hard copy of your code and any example outputs, and either turn them in at lecture or in the TA's mailbox at Sudikoff by class time on Wed. Oct. 28, 2009. Late assignments will be strongly penalized, so please start early.

For this assignment, you may work either alone, or with one (1) other partner. You may not share your code with anyone other than your partner, nor may you look at anyone else's code. See the Course Information for a review of the policies on joint work.

What to Turn In

For CS 8 programming assignments, such as this one, you must turn in a hard copy listing of any procedures you wrote yourself, or modified from the example code provided with the assignment. You must also turn in example output for each section of the problem set, as applicable.

You should not turn in printouts of any code we provided you which you did not make any changes to, and please be sure to label everything carefully, so that the graders will know what they are looking at.

You must also e-mail an electronic version of your code to cs8hw@cs.dartmouth.edu with the subject "CS 8 Problem Set 3", by no later than 2:00 am on the date it is due. Please organize your code in one or more plain text files and attach them as enclosures to your message. Do not include your code directly in the body of the e-mail message! If you are sending more than one file, please make sure the files have descriptive names, and that you include comments that will let the graders know what they are looking at.

Code submitted electronically must be sent as plain text, in one or more files with the filename extension ".hs" or ".lhs" (e.g., mycode.hs). Do not send your programs as Word documents, RTF, or other "formatted" types of files.

Output

For each part of any assignment, we require output which demonstrates that your code correctly solves the problem it is supposed to solve. If you turn in code with no output, you may receive only partial credit, or no credit, for that part of the assignment.

Some problems will give specific test cases you need to provide output for. However, even if specific test cases are not provided, and even if it is not listed under "What to Turn In" for a particular problem, you must provide some sample output that shows how your code behaves on typical inputs.

If you are unable to get your code working, please turn in whatever you do have, so that we can determine if you should receive partial credit. Please do not include output which suggests your code works if it doesn't -- however, if your code works for some of the test cases, and not for others, you should turn in whatever output you are able to generate, even if it is incorrect. You will get credit for the "Testing" part of the score, even if the tests indicate that the program is incorrect.

Output is not required to be submitted electronically, although you are welcome to do so if you wish. If you do choose to submit output electronically, put your output samples in a separate file from your code submissions. Output may be submitted in any reasonable file format, though plain text is easiest to read.


Background Information on Cluster Analysis

"Cluster analysis" seeks to separate objects into groups (or "clusters"), such that the objects within a group are more similar to each other than they are to objects in other groups. For example, we could cluster students by locations of their hometowns, or by which courses they've taken.

This lab studies cluster analysis of "time series". Here, each object has a value for each point in time, over a discrete set of time points. For example, a time series could represent the closing price of a stock each day during a month or the crime rate in a city each week during a year. Cluster analysis of time series provides insights into common (and different) trends, which may in turn be used to target studies or make predictions. For example, we could see which stocks rise and fall together (dot-coms, energy companies, investment banks) or which cities have had similar increases or decreases in crime rates.

We will will view each sample as a dimension, and will think of a time series of n samples as an n-dimensional point. This point will be represented as a Series, which is a list of Float values. We will label each Series with a unique name (e.g. the stock's name or the experiment's name) and call this labeled pair a NamedSeries. A Dataset is then a list of NamedSeries.

An example of nine time series, each consisting of five samples, is:

type Series = [Float]
type NamedSeries = (String, Series)
type Dataset = [NamedSeries]

ds :: Dataset
ds = [("A",[1, 1.5, 1.25, 2.0, 1.74]), ("B",[10, 9, 5, 7, 8]), 
      ("C",[2, 1, 1.5, 3, 1.5]), ("D",[5, 6, 3, 5, 4.5]), 
      ("E",[9, 10, 4, 6, 9]), ("F",[1.4, 2.2, 0.4, 1.1, 0.5]),
      ("G",[9, 10, 4, 6.5, 8.5]), ("H",[4, 7, 5, 3.5, 5]), 
      ("I",[8, 10, 5.5, 6.6, 10])]

It is hard to tell by looking at these series which ones are similar. It is easier if we plot them. Plotting points in five dimensions on a two dimensional screen is difficult, so we represent each time series as a polyline connecting five points, where the x-axis is the dimension (position of the sample in the list) and the y-axis is the value. The Dataset given above then appears:

Time series

We will measure similarity of two Series by their Euclidean distance. (Computing this distance was part of SA 1.) The dataset given above can be split into three clusters using the k-means method described below. The three clusters, plotted in different colors, then appear:

Three clusters

Stock Market Data

We have supplied data on the closing prices of every stock in the SP 500 for the month from Sept. 11, 2008 to Oct. 10, 2008, taken from http://biz.swcp.com/stocks/. (This month was chosen because the market was rather active then - it was the big market meltdown.) About 10 stocks had closing prices for only part of the month, so I did not include them. Unfortunately these include some stocks like Fannie Mae and Lehman Brothers that would have been interesting to have followed, but which stopped trading or went bankrupt during the month. Therefore there are fewer than 500 stocks in the data set.

The data consists of (tickerName, Series) pairs, where the tickername is the abbreviation for the stock name that appears on a ticker tape machine. The data is in the file stockData.txt. (The original file from which this data was extracted is spmohst.txt. I extracted it by using a text editor to replace commas by spaces and then running the program ConvertStocks.hs. Finally I filtered out stocks with fewer than 22 trading days.) Note that this is what we want for the Dataset type defined above.


Programming Exercises

Problem 1: Implementing k-means Clustering

One way to cluster objects is called k-means clustering. The goal is to find k different clusters. Each cluster has a "prototype", which is defined to be the centroid of cluster. (The centroid is computed as follows: the jth position in the centroid is the mean (average) of the jth positions of all the series in the cluster.) We want every series in a cluster to be closer to that cluster's prototype than to any of the other prototypes. Thus a prototype is a good representative for the items in a cluster.

We start the process with k unique time series as the initial prototypes. We then alternate between two steps:

  1. For each time series, find its nearest prototype. This gives a new clustering of the time series.
  2. Compute a new prototype for each new cluster by computing the centroid of the objects in the cluster.

We stop when the cluster assignment doesn't change, or when we've reached a maximum number of iterations. (In June of 2009 a paper was published that showed that in the worst case the time for k-means to converge can be exponential.)

Implement the k-means clustering algorithm. Use this datatype to represent clusters:

type ProtoCluster = (Series,Dataset)

A ProtoCluster pairs a prototype Series with the list of NamedSeries that it represents.

You might find it useful to write functions to perform the following operations:

This approach is similar to the approach for calculating motifs in Motifs.hs . You are not required to use this exact decompostition into functions. If you come up with a different approach feel free to use it.

In order to allow you to compare two lists of prototypes to see if they are the same, you should keep the NamedSeries within a DataSet sorted and keep the ProtoClusters in the list of ProtoClusters sorted. Fortunately lists can be compared to lists, pairs can be compared to pairs, lists of lists can be compared to lists of lists, etc. using the usual equality and inequality operators (e.g. ==), assuming that the types of the things compared are the same.

The lists are compared lexicographically by their items. That means that the first item in the list or tuple is compared to the first item of the second list of tuple (recursively if necessary). If the first items are unequal, the smaller first item is the smaller list or tuple. Otherwise we compare second items. This is how we do alphabetical order. This is even extended to compare trees!

You may find it useful to import the List module, because it has a sort function that will return a sorted version of the list passed to it as a parmeter.

The result of running kmeans on the data given above (after some formatting to make it easier to read) is:

[([1.4666667,1.5666666,1.0500001,2.0333333,1.2466667],
 [("A",[1.0,1.5,1.25,2.0,1.74]),("C",[2.0,1.0,1.5,3.0,1.5]),
  ("F",[1.4,2.2,0.4,1.1,0.5])]),
([4.5,6.5,4.0,4.25,4.75],
 [("D",[5.0,6.0,3.0,5.0,4.5]),("H",[4.0,7.0,5.0,3.5,5.0])]),
([9.0,9.75,4.625,6.525,8.875],
 [("B",[10.0,9.0,5.0,7.0,8.0]), ("E",[9.0,10.0,4.0,6.0,9.0]),
  ("G",[9.0,10.0,4.0,6.5,8.5]), ("I",[8.0,10.0,5.5,6.6,10.0])])]

Plotting this clustering with prototypes in white gives:

Three ProtoClusters

Plotting modules

We have supplied you with the modules that were used to plot the figures above. You may download a zip file here. There is a new version of Shape called ShapeX.hs that adds a Polyline shape. A new version of Draw called DrawX allows the user to specify arbitrary ranges on the x and y axes. (Creating this kind of a drawing program was suggested in class.) It also draws the new PolyLine shape. Finally, a module called DrawClusters has functions that take lists of Datasets or lists of ProtoClusters and draw them, with each Dataset or ProtoCluster in its own color. For ProtoClusters the set of prototypes is also plotted, as above. These functions are:

-- Draws a list of [[Series]], with each [[Series]] drawn in a 
-- different color.
-- Must be at least as many colors as [Series]
-- The last three parameters are the window size and title
drawSeriesScreen :: [[Series]] -> [Color] -> Int -> Int -> 
                      String -> IO () 

-- Draw a list of protoclusters, with the protoypes grouped and 
-- colored with the last color and the clusters colored with the rest.  
-- The length of colors must be one greater than the number of clusters.
-- The last three parameters are the window size and title
drawProtoClustersScreen :: [ProtoCluster] -> [Color] -> Int -> Int -> 
                           String -> IO () 

You should write a Clusters module that implements the algorithms in this part and the next part. Then this module and the DrawClusters module can both be imported into a driver program to test your code and create your output. (It would be wise to test the Clusters module independently when you debug it.)

Your Clusters module should export two utility functions (along with the other functions described in this assignment and whatever else you want to provide):

-- Extracts the Series data from a DataSet
extractSeries :: Dataset -> [Series]

-- Extracts name data from a DataSet
extractNames :: Dataset -> [String]

The first is used by DrawClusters, and the second will prove useful for some of the things that you are asked to do later. Defining such utility functions will make your code easier to read and debug, even though both functions can be quite short.

What to Turn In

Turn in your code that implements k-means clustering. Turn in test runs (including both text and graphical output) that demonstrate that your program works. Turn in the driver program that generates the test runs. Demonstrate that your program works even if two Series are identical (but the names are different). If there are fewer than k unique series then the final number of clusters should be the number of unique series. What other special cases should you consider?

Problem 2: Hierarchical Clustering

A problem with k-means clustering is that we may not know what k is. Another way to cluster, avoiding such a parameter, is called "hierarchical" clustering. A hierarchical clustering is a binary tree. The idea is that each object starts in its own cluster, and we repeatedly find the closest pair of clusters and make them the children of a newly formed cluster. The distance between two clusters can be computed in a number of ways. For the regular assignment we define the distance between two clusters to be the distance between the centroids of the two clusters. We define the centroid of a cluster to be the centroid of all NamedSeries in the cluster. We will explore another option in the extra credit part.

We can define an appropriate datatype as follows:

data HierCluster = Leaf NamedSeries
                 | Branch HierCluster HierCluster
                   deriving (Show, Eq, Ord)

Note that this is just the Tree data type from p. 82 of SOE, with a being NamedSeries. So an alternative is to copy the Tree code and then define:

type HierCluster = Tree NamedSeries

We have supplied you with a module ClosestPairPQ.hs that does much of the work needed to build hierarchical clusters for NamedSeries. A ClosestPairPQ is a Priority Queue that allows insertion of individual items and deletion and return of the closest pair of items, using a distance measure supplied when the ClosestPairPQ is created via a call to empty.

The following is pseudocode for the function hierCluster :: Dataset -> HierCluster that computes a hierarchical cluster from a DataSet:

The ClosestPairPQ module has an internal priority queue. It currently imports the module ListPriorityQueue which we have supplied. As a short assignment you will make a much faster MapPriorityQueue built on Data.Map. You should modify ClosestPairPQ to import this improved priority queue, but you can use the list priority queue meanwhile.

The tree constructed for the example above is:

Branch (Branch (Leaf ("F",[1.4,2.2,0.4,1.1,0.5])) 
(Branch (Leaf ("A",[1.0,1.5,1.25,2.0,1.74])) (Leaf ("C",[2.0,1.0,1.5,3.0,1.5])))) 
(Branch (Branch (Leaf ("D",[5.0,6.0,3.0,5.0,4.5])) (Leaf ("H",[4.0,7.0,5.0,3.5,5.0]))) 
(Branch (Leaf ("I",[8.0,10.0,5.5,6.6,10.0])) (Branch (Leaf ("B",[10.0,9.0,5.0,7.0,8.0])) 
(Branch (Leaf ("E",[9.0,10.0,4.0,6.0,9.0])) (Leaf ("G",[9.0,10.0,4.0,6.5,8.5]))))))

The same text can be laid out to make the tree structure a bit clearer:

Branch 
  (Branch 
    (Leaf ("F",[1.4,2.2,0.4,1.1,0.5])) 
    (Branch 
      (Leaf ("A",[1.0,1.5,1.25,2.0,1.74])) 
      (Leaf ("C",[2.0,1.0,1.5,3.0,1.5]))
    )
  )
  (Branch 
    (Branch 
      (Leaf ("D",[5.0,6.0,3.0,5.0,4.5])) 
      (Leaf ("H",[4.0,7.0,5.0,3.5,5.0]))
    ) 
    (Branch 
      (Leaf ("I",[8.0,10.0,5.5,6.6,10.0])) 
      (Branch 
        (Leaf ("B",[10.0,9.0,5.0,7.0,8.0])) 
        (Branch 
          (Leaf ("E",[9.0,10.0,4.0,6.0,9.0])) 
          (Leaf ("G",[9.0,10.0,4.0,6.5,8.5]))
        )
      )
    )
  )

This is a rather messy thing to print out. We also want to to print a parenthesized version of the tree:

((("F",[1.4,2.2,0.4,1.1,0.5]) (("A",[1.0,1.5,1.25,2.0,1.74]) 
("C",[2.0,1.0,1.5,3.0,1.5]))) ((("D",[5.0,6.0,3.0,5.0,4.5]) 
("H",[4.0,7.0,5.0,3.5,5.0])) (("I",[8.0,10.0,5.5,6.6,10.0]) 
(("B",[10.0,9.0,5.0,7.0,8.0]) (("E",[9.0,10.0,4.0,6.0,9.0]) 
("G",[9.0,10.0,4.0,6.5,8.5]))))))

We also want a parenthesized display with series information removed:

(("F" ("A" "C")) (("D" "H") ("I" ("B" ("E" "G")))))

You should write utility functions for printing trees in three forms: the full tree (with Branch, Leaf, names, and series - this is what calling "print" on the tree gives), parenthesized with names and series, and parenthesized with names only. If you wish to print out a nicely formatted tree, that is the first part of the extra credit.

What to Turn In

Turn in your code that implements hierarchical clustering. Turn in test runs that demonstrate that your program works and the driver program that generates this output. Demonstrate that your program works even if two Series are identical (but the names are different). What other special cases should you consider?

Finding Clusters in Stock Data

In this part you will cluster the stocks by the similarities of their movements during month of the crash in 2008. To load this data set into a program, do the following:

-- Converts the input string into list of (ticker, Series) pairs
readStockData :: String -> [(String, Series)]
readStockData str = read str

main = do
  stockData <- readFile "stockData.txt"
  let stockSeries = readStockData stockData
  -- Process the data here --

To process the stock data, we must first normalize it. We are interested in the relative changes in the stock prices, not the absolute prices. To normalize the data, for each Series divide all values in the series by the first one. This makes all series begin with the value 1.0.

Run kmeans with 6 clusters. Print the result, but instead of printing all of the series information (which is many, many pages long) only print each prototype series paired with a list of names associated with it. (This will require writing a utility print function, or just appropriately processing the data in the command to print it.) Draw the clusters, using the routines supplied. Do the clusters seem to be reasonable?

Run hierCluster and print the structure, but not everything. Printing the series leads to very long, hard-to-read output. When printing the hierCluster just print the names in the parantheses, as shown above. (If you want to also print the tree with Branch and Leaf and names, but no series, feel free to do so.)

Compiling Code

The hierarchical clustering on stock data takes a while. You can speed it up by compiling your modules. Compiled modules run about 10 times as fast as interpreted ones. If you want to do so, the command is ghc --make DriverName.hs where DriverName is the name of your driver program (the one that calls all of the others). Or if you want it to be even faster, ghc --make -O DriverName.hs optimizes the code, speeding it up by another 40 percent or so. After doing this, when you run ghci and load the code it will load the compiled versions.

What to Turn In

Turn in your code that reads, normalizes, plots, and prints the stock data. Turn in the output (both graphical and printed) that you generate. Include the k-means clusters graph and printout. The printout should have prototypes and lists of names only (no series). Print the parentheized, name-only version of the hierarchical tree.

Also answer the following question. You were supplied with a ListPriorityQueue and wrote a MapPriorityQueue. How much difference does the choice of priority queue make in computing the hierarchical clustering of stock data? (Use the sample solution for MapPriorityQueue if yours does not work correctly.

Extra Credit

  1. Extra credit: The parenthesized display is not aesthetically pleasing. Invent and implement a better one. You might want to consider a layout similar to the formatted tree layout shown above.

  2. Extra credit: Another way to define the distance between two clusters is to use the average distance between all pairs (s1, s2), where s1 is a series in the first cluster and s2 is a series in the second cluster. Implement this alternate distance function. Does it change the clustering? What does it do to the run time required for large data sets?

  3. Extra credit: The hierarchical divisions do give rise to k-clusters with k from 1 to the number of datasets. The original set of clusters is an n-clustering, after one step we have an (n-1)-clustering, after another step we have an (n-2) clustering, etc. until we have a 1-clustering. Write a method which takes a parameter k and finds the k clusters determined by the last k clusters in the set. This may require either adding more information to the tree data structure or modifying the algorithm to allow you to recover the intermediate steps as well as the final result.

  4. Extra credit: In practice, the hierarchical divisions derived in the previous part may appear a bit arbitrary. We might want to adjust the clusters by moving series from one cluster to another cluster. One way to do this is to use the final k clusters in the set of clusters as the starting clusters for k-means. Implement this change. Do the k-clusters derived this way seem better that the ones using the first k unique items?

  5. Extra credit: Map the tickertape names into the sectors that they represent (energy, utilities, manufacturing, service, etc.). Do stocks in the same sector seem to cluster together? (This takes some research as well as some programming.)

  6. Extra credit: Find a dataset for some other domain and analyze it with your clustering algorithms.

What to Turn In

If you choose to do the extra credit problems, turn in:

  1. Your code for the new display and output showing what it looks like.
  2. Your code for the distance function, and output showing how it compares to the original distance function, and a comment about how it affects run time. Which do you think is better and why?
  3. Your code for determining and drawing the k-clusters and output that shows that it works.
  4. Your code for using the hierarchical clusters as starting values for the kmeans algorithm, the results, and your comments about whether this seems to change the clusters.
  5. Your code and data for mapping stock names into sectors, and the clusters with stock names replaced by or paired with their sectors.
  6. A description of the domain, the data set that you found, the results of the clustering algorithms, and what you concluded.

The rubric for grading can be found here.



Department of Computer Science, Dartmouth College, Hanover, New Hampshire, USA

Last modified: 21-October-2009