COSC 8 -- Problem Solving with CS

CS 8, Winter 2009

Problem Set 5: Social Networks, BFS, and the Kevin Bacon Game


General Instructions

Before you do anything, please read the entire assignment carefully. It contains many suggestions which can save you a lot of struggle later on.

You should print out hard copy of your code and any example outputs, and either turn them in at lecture or in the TA's mailbox at Sudikoff by class time on Wed. Feb. 18, 2009. Late assignments will be strongly penalized, so please start early.

For this assignment, you may work either alone, or with one (1) other partner. You may not share your code with anyone other than your partner, nor may you look at anyone else's code. See the Course Information for a review of the policies on joint work.

What to Turn In

For CS 8 programming assignments, such as this one, you must turn in a hard copy listing of any procedures you wrote yourself, or modified from the example code provided with the assignment. You must also turn in example output for each section of the problem set, as applicable.

You should not turn in printouts of any code we provided you which you did not make any changes to, and please be sure to label everything carefully, so that the graders will know what they are looking at.

You must also e-mail an electronic version of your code to cs8hw@cs.dartmouth.edu with the subject "CS 8 Problem Set 5", by no later than 2:00 am on the date it is due. Please organize your code in one or more plain text files and attach them as enclosures to your message. Do not include your code directly in the body of the e-mail message! If you are sending more than one file, please make sure the files have descriptive names, and that you include comments that will let the graders know what they are looking at.

Code submitted electronically must be sent as plain text, in one or more files with the filename extension ".hs" or ".lhs" (e.g., mycode.hs). Do not send your programs as Word documents, RTF, or other "formatted" types of files.

Output

For each part of any assignment, we require output which demonstrates that your code correctly solves the problem it is supposed to solve. If you turn in code with no output, you may receive only partial credit, or no credit, for that part of the assignment.

Some problems will give specific test cases you need to provide output for. However, even if specific test cases are not provided, and even if it is not listed under "What to Turn In" for a particular problem, you must provide some sample output that shows how your code behaves on typical inputs.

If you are unable to get your code working, please turn in whatever you do have, so that we can determine if you should receive partial credit. Please do not include output which suggests your code works if it doesn't -- however, if your code works for some of the test cases, and not for others, you should turn in whatever output you are able to generate, even if it is incorrect. You will get credit for the "Testing" part of the score, even if the tests indicate that the program is incorrect.

Output is not required to be submitted electronically, although you are welcome to do so if you wish. If you do choose to submit output electronically, put your output samples in a separate file from your code submissions. Output may be submitted in any reasonable file format, though plain text is easiest to read.


Background Information on Social Networks and the Kevin Bacon Game

A social network is a set of people with some rules for determining pairs who are "related" to one another. Social networks can be represented as graphs, with the people as vertices and the "related to" relationship defining the edges. Examples are Facebook and MySpace, where the vertices are people who have created web pages in the system and the edges are friend relationships. One can ask the question, "What is the shortest path from me to a friend, to a friend of that friend, and so on until I get to a certain person?"

A geekier example is Erdos Numbers. Paul Erdos was a Hungarian mathematican who published over 1500 papers, most of which were co-authored. He spent his life traveling from university to university, living on speaking honoraria and the hospitality of those he visited. He would give talks and collaborate on research. The vertices of this graph are mathematicians and scientists, and the edge relationship is, "were co-authors on a scholarly paper". One's "Erdos number" is the number of papers (edges) one must pass through to get to Paul Erdos. For more on Erdos numbers, see http://www.oakland.edu/enp/.

In this problem you will write a program to play the Kevin Bacon game. The vertices in this network are actors and the edge relationship is "appeared together in a movie". The goal is to find the shortest path between two actors. Traditionally the goal is to find the shortest path to Kevin Bacon. The following output from the sample solution shows how the game is played:

Enter a destination actor (empty line to quit): Kevin Bacon
Enter a source actor (empty line to choose a new destination): Diane Keaton
Diane Keaton
  appeared in Hanging Up (2000) with
Meg Ryan
  appeared in In the Cut (2003) with
Kevin Bacon

Enter a source actor (empty line to choose a new destination): Buster Keaton
Buster Keaton
  appeared in Limelight (1952) with
Nigel Bruce
  appeared in Suspicion (1941) with
Cary Grant
  appeared in Charade (1963) with
Walter Matthau
  appeared in Hanging Up (2000) with
Meg Ryan
  appeared in In the Cut (2003) with
Kevin Bacon

Enter a source actor (empty line to choose a new destination): 

So based on the data set we supply for this problem, Diane Keaton's Bacon Number is two, and Buster Keaton's Bacon Number is five.

Programming Exercises

Problem 1: Implementing Breadth-First Search

The easiest way to play the Kevin Bacon game is to do what is called breadth-first search (BFS) in the movie data graph. This builds a tree of shortest paths from every actor who can reach Kevin Bacon back to Kevin Bacon. Or more generally, given a root BFS builds a shortest-path tree from every vertex that can reach the root back to the root. It is a tree where every vertex points to its parent, and the parent is the next vertex in a shortest path to the root.

To implement BFS we use a queue, which can be found in Queue.hs, which in turn requires Stack.hs. We also use a Digraph implementation, which can be found in DigraphMap.hs. I recommend downloading all of the code needed from ps5Code.zip and using this folder as your starting place for developing this assignment.

The psuedocode describing BFS of a graph G is:

  Insert root into an empty queue Q and into a new graph T

  Until Q is empty
      dequeue Q to get next vertex v to process
          for each (v', e') in v's adjacency list in G
              if v' not in T
                  add v' to T and add an edge labeled e' from v' to v
                  enqueue v' Q
  return T

When you are done, T holds a shortest-path or BFS tree. To find the Bacon number of an actor, look the actor up in T. If there is no vertex for that actor in T, then the actor is not connected to the root. If the actor is there, follow edges of T back to the root, printing movies (edge labels) and actors (vertices) along the way.

Create a BFS module that imports Queue and DigraphMap and exports DigraphMap and the following function, whose parameters are a Digraph and the root node for BFS:

bfs           :: (Ord v) => Digraph v e -> v -> Digraph v e
bfs graph root =
Test your function on a graph with the following vertices and edges:
vertices = ["Kevin Bacon", "actor1", "actor2", "actor3", "actor4", "actor5", "actor6"]
edges = [("Kevin Bacon", "actor1", "movie1"), ("actor1", "Kevin Bacon", "movie 1"), 
("Kevin Bacon", "actor2", "movie1"), ("actor2", "Kevin Bacon", "movie1"),
("actor1", "actor2", "movie1"), ("actor2", "actor1", "movie1"),
("actor1", "actor3", "movie2"), ("actor3", "actor1", "movie2"),
("actor3", "actor2", "movie3"), ("actor2", "actor3", "movie3"),
("actor3", "actor4", "movie4"), ("actor4", "actor3", "movie4"),
("actor5", "actor6", "movie5"), ("actor6", "actor5", "movie5")]

What to Turn In

Turn in your code that implements BFS, along with testing code that creates the digraph specified above and runs BFS on it. Print out the digraph created and the digraph returned by BFS.

Problem 2: Reading Actor and Movie Information and Creating Tables

You are now ready to create a module that plays the Kevin Bacon game. The first thing to do is to read and create tables of actor and movie data. This data was download from: http://www.cs.luther.edu/~bmiller/CS151/Spring05/index.html You will read three files: actors.txt, movies.txt, and movie-actors.txt. You can hard-wire these names into your code and use readFile to read them in. But they are large files - actors.txt contains 9,235 actors, movies.txt contains 7,067 movies, and movies-actor.txt contains 21,370 movie-actor pairs, resulting in 64,674 edges in the final graph. So while you are developing your program use the files actorsTest.txt, moviesTest.txt, and movie-actorsTest.txt, whose data represent the graph specified above. These six files can be downloaded in a single zip file: ps5data.zip.

The files are all formatted the same way. Each line has two quantities separated by a "|". In the actors file the quantities are actorID and actorName. In the movies file they are movieID and movieName. In the movies-actors file they are movieID and actorID, indicating that the actor associated with actorID appeared in the movie associated with movieID.

You should create a function that takes a string consisting of a number of lines, each of which is of the form "val1|val2". It should return a list of [val1, val2] lists. Look in Chapter 23 for useful functions that make this easy to do.

Given this function, convert the files into three database Tables, one for movie information, one for actor information, and one for movie-actor information. The database module that you can use to create and manipulate these tables is Database.hs. Use the various database functions (join, project, etc.) to create a list of vertices (one for each actor) and an edge for each pair of actors who appear in a movie together. Build this graph using the actual actor names and movie names, not the IDs. You should assume that no movie appears twice in the movies file and that no actor appears twice in the actors file. It is OK for there to be multiple edges between a pair of actors if they appeared together in multiple movies. However, you should eliminate loops (edges from an actor to himself or herself). If you don't you will probably run into stack overflow on the big data set.

What to Turn In

Turn in the code that reads the files and produces the graph. Show the digraph produced from the files moviesTest.txt, actorsTest.txt, and movies-actorsTest.txt.

Implement the Bacon Game

You now have what you need to implement the Bacon game. Ask for the name of the destination actor (traditionally Kevin Bacon). If the user enters an empty line, quit. Perform BFS on the graph with the source actor as root and save the BFS tree returned. Then ask for a series of source actors. For each one, print out the path between that actor and the source, or say that none exists. This will require following a path from the chosen actor in the BFS tree back to the root. (See above for an example of how this might be formatted.) If the user gives a name that is not in the graph say so and prompt again. If the user enters an empty line, ask for a new destination vertex.

Test your program on the movieTest.txt, actorTest.txt, and movie-actorTest.txt files. Make sure to demonstrate that you program works for boundary conditions (actor not connected to the root, actor is the root, etc.).

When you are sure that your program works on the test data, change it to use the movie.txt, actor.txt, and movie-actor.txt files. Demonstrate that your program works for these as well.

Experiment by importing Digraph.hs instead of DigraphMap. Remember that this implementation of Digraph uses a list instead of a map to keep track of the actors. What does this do to the run time for the game, particularly for the BFS?

What to Turn In

Turn in your code that plays the game. (This should be in the same file as the code from the last part.) Turn in test runs for both the Test files and runs for the large files.

Extra Credit

  1. Extra credit: It is annoying that if you don't get the exact form of the name you will not find the actor. For example, if you type "Timothy Allen" instead of "Tim Allen", "Kate Blanchett" instead of "Cate Blanchett" or "Katherine Hepburn" instead of "Katharine Hepburn" you will not find the actor. Devise some sort of scheme for finding "near matches" and reporting near matches to the name entered, so the user can re-enter the correct form of the name.

  2. Extra credit: Statistical analysis: Compute some interesting satistics about this graph. Some possible examples: Find an actor with the largest finite Bacon number for this data set. Find the average Bacon number for those actors with finite Bacon numbers. Find the destination actor that minimizes either longest path length or average path length from all other actors.

What to Turn In

If you choose to do the extra credit problems, turn in:

  1. Your code for finding "near matches" and a demostration that it works.
  2. Your code for computing the various statistical measures, along with a demonstration that it works on the Test graph and your results for the real graph.

The rubric for grading can be found here.



Department of Computer Science, Dartmouth College, Hanover, New Hampshire, USA

Last modified: 16-February-2009