Assignment: Sorting it all out

In this assignment, you will read a text file in which each line describes a world city, store the information about each city in its own object, and create a list of references of all objects, sort the list using quicksort, write the results out to another file, and display part of the output file in a visualization on a world map.

Computer science is about a lot more than handling data, but computers are still useful for working with large amounts of data. Scientific experiments in biology, chemistry, physics, and psychology can generate huge numbers of measurements. In economics, business, or politics, data can be generated by everything from stock prices, to poll results, to voting records of congresspeople.

Checkpoint: File input and output, writing the City class, and creating a list of objects

For the checkpoint, you will need to define a City class, read the file of world city information, store the information about each city into its own City object, make a list of city objects, write out each object (converted to a string) to a file, and draw the first 10 cities in the list to test your City class draw method on the provided world map.

You can focus on the sorting(quicksort) after the checkpoint.

Step 1: Define a City class

Your first job is to define a City class in a file named city.py. Each City object will need instance variables to store the following information:

Make sure you convert the values to their expected type before you store them in a City object. You need to start with three methods in the City class.

  1. The __init__ method does the usual: it takes six parameters (plus self) and stores each in the appropriate instance variable.
  2. The __str__ method returns a string consisting of the city’s name, population, latitude, and longitude, separated by commas and with no spaces around the commas. The string returned by your __str__ method should be exactly as specified here, because your section leader will compare your resulting file with one that I’ve prepared, and the two files need to be identical in order to give you full credit for the checkpoint. For example, here’s the string that __str__ should return for a particular town in New Hampshire:

    Hanover,11222,43.7022222,-72.29

    Notice that the country code and region do not appear in the string returned by __str__.

  3. The draw method takes the latitude and longitude of the city and draws it on world map provided to you.

Step 2: Read and parse the input file

The input file is in world_cities.txt. Download this file, and drag it into your PyCharm project. This file has 47914 lines (the last line will be a blank line). Each line has information about a world city, specifically the six attributes you’ll store in the City object, but of course , it is represented textually. These attributes are separated by commas without surrounding spaces. We call such a file a CSV file. (CSV is short for “comma-separated values.”) Here’s the line of the world_cities.txt file for that town in New Hampshire:

us,Hanover,NH,11222,43.7022222,-72.2900000

Here, the country code is us, the city name is Hanover, the region is NH, the population is 11222 (Dartmouth students living on campus don’t count in Hanover’s population), the latitude is 43.7022222, and the longitude is –72.2900000. You already know how to read in a file, line by line. But if you forgot, see the notes from Chapter 8. But how can you process this line? Python makes it pretty easy. To separate text at a particular delimiter character, call the split method on the string. It returns a list of the strings that, concatenated with the delimiter added in, gives you back the string. An example best shows this idea:

s = "a man, a plan, a canal,"
clauses = s.split(",")
print(clauses)

prints [‘a man’, ‘a plan’, ‘a canal’, ’’]

Note that every character other than the commas becomes part of one of the strings in the list. Not only do the spaces after the first two commas become part of strings in the list, but we also get an empty string at the end. Why? Because the string s ends with a comma, and so the split method gives back the empty string after the comma.

You can use any character as the delimiter, and the default is a space character. More examples:

s = "Roll out the barrel"
words = s.split()
print words

prints [‘Roll’, ‘out’, ‘the’, ‘barrel’] And

s = "commander--in-chief"
parts = s.split("-")
print parts

(there are two hyphens between commander and in) prints [‘commander’, ’‘, ’in’, ‘chief’] Again, note the empty string caused by two delimiters in a row.

To strip away whitespace from the beginning and end of a string, call the strip method on the string:

s = "  The thousand injuries of Fortunato I had borne as best I could, \n"
t = "|" + s.strip() + "|"
print(t)

prints |The thousand injuries of Fortunato I had borne as best I could,|

I put the vertical bar characters in so that you could see where the string returned by the call s.strip() begins and ends. Although each line of world_cities.txt begins with a non-whitespace character, each line ends with a newline, and newline is a whitespace character. Therefore, you will want to call strip on each line of the file and then call split to separate the information by the commas. Once you have a list of information in a particular line of the file, you can send that information into the City constructor. Since split gives you a list, you can just index into that list for each item. Remember that some of the instance variables in a City object are not strings, and so you will have to convert these strings to the appropriate types.

The City constructor will give you back, as you undoubtedly know, a reference to a City object. You should append that reference to a list that you’re building up. When you’re done, the list should comprise 47913 references to City objects, one for each line in world_cities.txt.

Step 3: Write the output file

Once you have the list of references to City objects and you have the __str__ method of the City class defined, all you need to do now is create a file named cities_out.txt and write one line to it for each City object. The line for each city should contain the string returned by calling str on the corresponding City object. Because the __str__ method inserts commas, cities_out.txt will be a CSV file.

You’ll need to add the newline for each city. When you’re done, your cities_out.txt file should have 47914 lines. The first five and last five lines of cities_out.txt should be

Andorra La Vella,20430,42.5,1.5166667
Canillo,3292,42.5666667,1.6
Encamp,11224,42.5333333,1.5833333
La Massana,7211,42.55,1.5166667
Les Escaldes,15854,42.5,1.5333333
...
Redcliffe,38231,-19.0333333,29.7833333
Rusape,23761,-18.5333333,32.1166667
Shurugwi,17107,-19.6666667,30.0
Victoria Falls,36702,-17.9333333,25.8333333
Zvishavane,79876,-20.3333333,30.0333333

Because the format of this file is tightly constrained, and because you’ll be creating the file from world_cities.txt, it should match a file your section leader has in every single byte.

Step 4: Visualize cities

CSV files are just fine for computers, but they are not wetware-friendly. You will write a program that visualizes the cities in our parsed list of cities.

Download this image file, and put it into your project in PyCharm. Your program will overlay the image with marks for the first n cities in the list of parsed cities. For example, here’s the first 30 cities of the parsed file, drawn on the image:

It’s easy to display this image file in a graphics window. After the window has been opened (calling start_graphics() and clear()), all you need to do is

img = load_image("world.png")
draw_image(img, 0, 0)

The first parameter to draw_image refers to an object for the image returned by load_image. (I don’t know what kind of object this is, nor do I care. Yay, abstraction!) The second and third parameters give the x- and y-coordinates of the upper left corner of the image in the graphics window. If you are going to draw the image repeatedly, you may call load_image just once and call draw_image multiple times, as long as you have saved the reference returned by load_image. To do so, call the load_image outside your main draw function.

To draw cities onto our map, let’s remind ourselves about the instance variables in our City object: - The city’s country code, which is a two-letter string. - The city’s name, which is a string. - The city’s region, a two-character string. - The city’s population, which is an int. - The city’s latitude, which is a float. - The city’s longitude, which is also a float.

To draw the city’s position on our image, we’re going to make use of the latitude and longitude in each city. These values must be stored as floating point numbers in City objects.

The map is an equirectangular projection that graphs all meridians and circles of latitude as evenly-spaced parallel lines. Simply put, the center of this image is latitude 0, longitude 0. The image is 720 x 360 (720 pixels wide by 360 pixels tall), latitudes (horizontal lines) range from -90 to +90, longitudes (vertical lines) range from -180 to +180 (treating the center of the projection as latitude 0, longitude 0).

Similar to the draw() method in your Body class in Lab2, you’ll write a third method in your City class of the form: def draw(self, cx, cy):

    # scale self.lat and self.long to the size of the image,
    
    # let's say px and py are the scaled values
    # call draw_circle(px, py, RAD) to draw that city in its
    # real-life location on the map

Notice that the values of cx and cy are going to be constants in your program, and they give you the pixel coordinates of the point that is at latitude 0 and longitude 0. Make sure your graphics window is large enough to display the full image. The User’s Guide contains everything you need to know about how to draw the image and how to change the window size.

To test your draw() method, draw the first 10 cities in the list of references to City objects that you created.

I’ve shown you the first 30 above, though they overlap because the latitude and longitude are similar.

What to submit for the checkpoint

For this assignment submit a zip file containting the following:

  1. your City class file
  2. a drive file with the following code:
  3. your cities_out.txt output file
  4. screeshot of the visualization after 10 cities are drawn

After the checkpoint

You saw the selection sort algorithm and merge sort algorithm in class. In this assignment, you will implement yet another sorting algorithm: quicksort.

We haven’t discussed the runtime of algorithms yet, but when we do, you can come back and reread this runtime analysis. Quicksort has a worst-case running time of O(n2) (worse than merge sort), but the worst-case for quicksort occurs rarely—rarely enough that, on average, it runs in O(nlog n) time. Quicksort has two advantages over merge sort: the constants are better, and quicksort runs “in place”: no temporary lists have to be created, and so very little additional memory is required to quicksort.

Now, you might ask why you need to implement quicksort when Python has a sorting method built in and when you’ve already learned three other sorting algorithms. That’s a fair question, but if the professors who taught the Python designers had let them skip sorting, we wouldn’t have such a nice built-in sort function in Python. I won’t let you skip learning how to sort, either!

Quicksort

Before you can sort the list of Cities, you will need to implement the quicksort algorithm and test it on a list of integers and a list of strings.

Like merge sort, quicksort is recursive. As in merge sort, the problem is to sort a sublist the_list[p : r+1], which consists of the items starting at the_list[p] and going through the_list[r]. (Recall again that in Python’s slice notation, the number after the colon is the index of the first item not in the sublist.) Initially, p equals 0 (zero) and r equals len(the_list)-1, but these values will change through the recursive calls.

Important note: In several places in this lab writeup, we use Python’s sublist notation, such as the_list[p : r+1]. We use this notation only to indicate the range of indices you should be working with. Do not use this sublist notation in your program. Why not?

Because it will make a copy of the sublist of the_list, and if you change values in a sublist you create using sublist notation, the values in the_list itself will not change.

The base case is the same as for merge sort: a sublist with fewer than two items are already sorted.

Partitioning

It doesn’t take much insight to see that all the work in quicksort occurs in the partitioning step, so let’s see how to do that. You will write a function with the following header:

def partition(the_list, p, r, compare_func):

This function should partition the sublist the_list[p : r+1], and it should return an index q into the list, which is where it places the item chosen as the pivot. As a running example, we’ll suppose that the current sublist to be sorted is [2, 8, 7, 1, 3, 5, 6, 4], so that the_list[p] contains 2 and the_list[r] contains 4. First, we need to select an item as pivot. Let’s always select the last item in the sublist, that is, the_list[r]. In our example, we select 4 as the value of pivot. When we’re done partitioning, we could have any of several results. Here are three of them:

[3, 2, 1, 4, 7, 8, 6, 5]
[1, 3, 2, 4, 8, 7, 5, 6]
[2, 1, 3, 4, 6, 8, 5, 7]

Several more are possible, but what matters is that 4 is in the position it would be in if the sublist were sorted, all items before 4 are less than or equal to 4, and all items after 4 are greater than 4. We want to partition the sublist efficiently. (Remember that we’re in the part of the course where not only do we want right answers, but we want to use resources—especially time—parsimoniously.) If the sublist contains n items, we might have to move each one, and so O(n) time for partitioning an n-item sublist is the best we can hope for. Indeed, we can achieve O(n) time. This picture gives you an idea of how:

You can also look at this PowerPoint.

The idea is to loop through the items in the_list[p:r] (which does not include the_list[r], the value of pivot.) Maintain two indices, i and j, into the sublist:

At the start of each loop iteration, for any index k into the sublist:

  1. If p ≤ k ≤ i, then the_list[k]  ≤  pivot.
  2. If i + 1 ≤ k ≤ j − 1, then the_list[k] > pivot.
  3. If j ≤ k ≤ r − 1, then we have yet to compare the_list[k] with pivot, and so we don’t yet know which is greater.
  4. If k = r, then the_list[k] equals pivot.
Start with i = p − 1 and j = p. There are no indices between p and i, and there are no indices between i + 1 and j − 1, and so the above conditions must hold: all indices k between j and r − 1 have yet to be compared with pivot. Set pivot = the_list[r] to satisfy the last condition. Each iteration of the loop finds one of two possibilities: - the_list[j] > pivot:
![](partition-one-step-gt.png)
</div>
All you need to do is increment <span class="math">*j*</span>.
Afterward, condition 2 holds for `the_list[j-1]`, and all other
items remain unchanged.

The loop should terminate immediately upon j equaling r. To be clear: the last time that the loop body should execute is for j = r − 1. The loop body should not execute when j = r.

Once j equals r, every sublist item before the_list[r] has been compared with pivot and is in the right place. All that remains is to put the pivot, currently in the_list[r] into the right place. Do so by swapping it with the item in the_list[i+1], which is the leftmost item in the partition known to be greater than pivot. (If the greater-than partition happens to be empty, then this swap should just swap the_list[r] with itself, which is fine in this case.) Your partition function should return to its caller the index where it has placed pivot. Within partition, that’s index i + 1.

The function compare_func will be one that you write and pass in. It takes two parameters, say a and b, and it returns True if a compares as less than or equal to b, and it returns False if a compares as greater than b. For example, the call compare_func(3, 5) should return True, the call compare_func(7, 5) should return False, the call compare_func(5, 5), should return True, the call compare_func("Tallahassee", "Tokyo") should return True, and the call compare_func("Boise", "Bismarck") should return False. Your partition function should call compare_func to compare each sublist item with pivot.

You can write your own functions that take a function as a parameter. Here’s an example:

def do_it(compare_func, x, y):
    if compare_func(x, y):
        print x, "is less than", y
    else:
        print x, "is not less than", y
def compare_ints(a, b):
    return a <= b
def compare_strings(a, b):
    return a.lower() <= b.lower()
do_it(compare_ints, 5, 7)
do_it(compare_strings, "Dartmouth", "Cornell")

The call do_it(compare_ints, 5, 7) passes the function compare_ints into do_it as the first parameter, and so in do_it, compare_func is really compare_ints. The other formal parameters of do_it are x and y, and they get the values 5 and 7, respectively. Therefore, the call compare_func(x, y) is really the call compare_ints(5, 7). This call returns True (the parameter a gets the value 5, and the parameter b gets the value 7), and so the line 5 is less than 7 is printed to the console. Next, the call do_it(compare_strings, "Dartmouth", "Cornell") passes the function compare_strings as the first parameter, and now in do_it, compare_func is really compare_strings. The formal parameter x references the string "Dartmouth", and the formal parameter y references the string "Cornell", so now when we call compare_func(x, y), we’re really calling compare_strings("Dartmouth", "Cornell"). The function compare_strings first makes copies of its parameters but converted to lowercase (that’s what the lower method of the string class, does), and then it compares them lexically, using the ASCII character codes. Because "dartmouth" comes after "cornell" lexically, compare_strings returns False, and so the console output is now Dartmouth is not less than Cornell (which is true, because as we all know, Dartmouth is greater than Cornell).

You should implement and test your partition function before writing any other code. Make sure that it works on an arbitrary sublist of the_list, and not just on the sublist where p = 0 and $r = $len(the_list)-1. You’ll have to write the compare_func function as well, and design it so that it works on whatever data type the_list contains. You should certainly test partition on lists containing numbers and on lists containing strings. I was able to write the body of the partition function in 10 lines of Python code (not counting blank lines and comment lines).

The quicksort function

Once you have the partition function written and tested, you should write a function with the header

def quicksort(the_list, p, r, compare_func):

which sorts the sublist the_list[p : r+1], whose first item is in the_list[p] and whose last item is in the_list[r].

It should work as described above:

The body of my quicksort function takes only 4 lines of Python, not counting comment lines.

The sort function

Unlike the examples from the lecture, you’ll write a separate function named sort in this lab assignment. It has this header:

def sort(the_list, compare_func):

It simply makes a call to the quicksort function to sort the entire list referenced by the_list.

Place the partition, quicksort, and sort functions in a single file named quicksort.py. Test your sort function on a list of integers and a list of strings by passing compare_ints and compare_strings described above. Print the lists before and after they are sorted to check if your quicksort implementation is correct. Once you are sure that your quicksort implementatin is correct, remove the test code from the driver.

What to do after impelmenting the quicksort algorithm

You will sort the information contained in world_cities.txt.

Produce three output files:

  1. cities_alpha.txt contains the list of cities sorted alphabetically.
  2. cities_population.txt contains the list of cities sorted by population, from most to least populous.
  3. cities_latitude.txt contains the list of cities sorted by latitude, from south to north.

Here are the first and last few lines of each of the files, when I ran my program:

  1. For cities_alpha.txt:

    A,1145,63.966667,10.2
    A,1145,63.966667,10.216667
    A Coruna,236010,43.366667,-8.383333
    A Dos Cunhados,6594,39.15,-9.3
    Aabenraa,16344,55.033333,9.433333
    Aabybro,4849,57.15,9.75
    Aachen,251104,50.770833,6.105278
    Aadorf,7100,47.483333,8.9
    Aakirkeby,2195,55.066667,14.933333
    Aakre,295,58.1013889,26.1944444
    ...
    Zychlin,8844,52.25,19.616667
    Zykovo,1059,54.0666667,45.1
    Zykovo,5365,55.9519444,93.1461111
    Zyrardow,41179,52.066667,20.433333
    Zyryanka,3627,65.75,150.85
    Zyryanovsk,44939,49.738611,84.271944
    Zyryanovskiy,896,57.7333333,61.7
    Zyryanskoye,6285,56.8333333,86.6222222
    Zyukayka,4556,58.2025,54.7002778
    Zyuzelskiy,1311,56.4852778,60.1316667
  2. For cities_population.txt:

    Tokyo,31480498,35.685,139.7513889
    Shanghai,14608512,31.005,121.4086111
    Bombay,12692717,18.975,72.825833
    Karachi,11627378,24.866667,67.05
    New Delhi,10928270,28.6,77.2
    Delhi,10928270,28.666667,77.216667
    Manila,10443877,14.604167,120.982222
    Moscow,10381288,55.7522222,37.6155556
    Seoul,10323448,37.5663889,126.9997222
    Sao Paulo,10021437,-23.533333,-46.616667
    ...
    Girsterklaus,12,49.7777778,6.4994444
    Vstrechnyy,12,67.95,165.6
    Qallimiut,12,60.7,-45.3333333
    Skaelingur,11,62.1,-7.0
    Ivittuut,11,61.2,-48.1666667
    Crendal,10,50.0577778,5.8980556
    El Porvenir,10,9.5652778,-78.9533333
    Schleif,8,49.9905556,5.8575
    Aliskerovo,7,67.7666667,167.5833333
    Neriunaq,7,64.4666667,-50.3166667
    Tasiusaq,7,73.3666667,-56.05
    Timerliit,7,65.8333333,-53.25
  3. For cities_latitude.txt:

    Ushuaia,58045,-54.8,-68.3
    Punta Arenas,117432,-53.15,-70.9166667
    Rio Gallegos,93234,-51.6333333,-69.2166667
    Port-Aux-Francais,45,-49.35,70.2166667
    Bluff,1938,-46.6,168.333333
    Owaka,395,-46.45,169.666667
    Invercargill,47287,-46.4,168.35
    Woodlands,285,-46.366667,168.55
    Riverton,1651,-46.35,168.016667
    Wyndham,586,-46.333333,168.85
    Wallacetown,638,-46.333333,168.266667
    ...
    Kullorsuaq,408,74.5833333,-57.2
    Savissivik,82,76.0233333,-65.0813889
    Moriusaq,21,76.7561111,-69.8863889
    Narsaq,1709,77.3025,-68.8425
    Qaanaaq,616,77.4894444,-69.3322222
    Qeqertat,18,77.5097222,-66.6477778
    Siorapaluk,75,77.7952778,-70.7558333
    Barentsburg,576,78.0666667,14.2333333
    Longyearbyen,1232,78.2166667,15.6333333
    Ny-Alesund,40,78.9333333,11.95

How to do it

To sort the list of cities (a list of references to City objects), you must create three comparison functions to compare City objects by three criteria. You will define the following functions:

  1. compare_alpha(city1, city2): to compare the City objects by name.
  2. compare_population(city1, city2): to compare the City objects by population.
  3. compare_latitude(city1, city2): to compare the City objects by latitude.

Note that the parameters city1 and city2 in all these three functions will be references to City objects. You will pass these functions as parameters to your sort function to sort by the respective criteria.

Some things you should note:

Final sorting

In the driver file you created for the checkpoint, add code to sort the list of cities by calling the sort function from quicksort.py. You will call the sort function thrice to sort according to the three criteria. After every call to the sort function, write the contents of the sorted list to a file. You will create three output files: cities_alpha.txt, cities_population.txt, and cities_latitude.txt. It is a good idea to define a function that writes the contents of the list to a given file.

Then, run start_graphics to visualize the 25 most populous cities. [Hint: the last call to the sort function will sort by the list of cities in non-increasing order of population.]

Here is a image of the map with 30 most populous cities marked:

What to submit

Submit a zip file containing the following:

  1. city.py file with the definition of City class.
  2. quicksort.py file with definitions of partition, quicksort, sort functions and any other functions these three functions might call. Remove all the code to test the sort function from this file.
  3. Driver file with the following:

     - Code to read from the `world_cities.txt` file and create a list of `City` objects.
     - Code to sort the cities by three criteria using the `sort` function defined in the quicksort.py file.
     - Code to write the list of cities to corresponding output files after sorting based on each criterion. You should include your compare functions in the driver code.
     - Code to visualize the top 25 cities in the list of cities after they are sorted by

    population.

  4. A screenshot of the world map with a visualization of the 25 most populous cities.
  5. The three output files cities_alpha.txt, cities_population.txt, and cities_latitude.txt.

Extra credit:

Any extra credit work should be submitted in separate Python files. Don’t add extra credit work into your main solution code. These are just a few ideas; you are more than welcome to make up your own.

Honor Code

The consequences of violating the Honor Code can be severe. Please always keep in mind the word and spirit of the Honor Code.