Due Monday, September 20 at 10pm

This first lab should get you up to speed working with the command line, basic shell commands, an editor, and a small bash program.

Preparation

Log in to the Thayer plank server (plank.thayer.dartmouth.edu) with your NetID and set up lab assignments, if you have not already:

[MacBook ~]$ ssh cs50
[plank ~]$ mkdir -p cs50-dev/labs
[plank ~]$ chmod go-rwx cs50-dev
[plank ~]$ cd cs50-dev/labs

These commands create a directory ~/cs50-dev/labs, removes read, write, execute permissions from the group and other users (i.e. prevent others from peeking at your work), and changes the working directory to labs so you’re ready to start.

Clone the starter kit: visit GitHub Classroom, accept the assignment, and clone the repository to your labs directory. It will look something like this, assuming your GitHub username is XXXXX:

$ git clone https://github.com/Dartmouth-CS50-Fall2021-Prioleau/lab-1-XXXXX
Cloning into 'lab-1-XXXXX'...

The clone step will create a new directory ~/cs50-dev/labs/lab1-XXXXX,

If you would prefer to work out the initial solutions on your laptop, run the above git clone command on your local laptop (without logging to CS servers via ssh). Later, use scp to push your solutions back to your Linux account, test them there, and then submit them from there.

Assignment

First download a spreadsheet from:

https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-County/8xkx-amqh/data

and save it as vaccine.csv. You can use the following command to do both in one step:

wget -O vaccine.csv https://data.cdc.gov/api/views/8xkx-amqh/rows.csv?accessType=DOWNLOAD

In the above line, the wget command is fetching a file at a given URL. The -O option (with character o in uppercase) specifies the file name to save it as.

Now, vaccine.csv is a comma-separated value (CSV) file from the Centers for Disease Control and Prevention (CDC) on COVID-19 vaccine administration presented at the county level. The CDC database is updated daily. Further description of the dataset can be found here.

A. Write a single bash command or pipeline to print only the lines for the state of New Hampshire in the month of August. The output should not contain the current first line, which lists the names of data fields. (5 points)

B. Write a single bash command or pipeline to print only the county (Recip_County), state (Recip_State), and percentage of fully vaccinated people (Series_Complete_Pop_Pct) columns. The output should be comma separated and should not contain the current first line, which lists the names of data fields. (5 points)

C. Write a single bash command or pipeline to print only the lines from August 11 to August 13, including the data on August 11. (5 points)

D. Write a single bash command or pipeline to print the counties with zero percent of fully vaccinated people in the state of California. Note that the latest date will have the cumulative data. (10 points)

E. Write a single bash command or pipeline to print the number of counties with zero percent of fully vaccinated people in each state. Present this in decreasing order based on the number of counties. Each line of the output should contain the number of counties with zero percent of fully vaccinated people and the state name. Note that the latest date will have the cumulative data. (10 points)

F. Write a single bash command or pipeline to print the counties with the top-10 highest percentage of fully vaccinated people based on the latest data. Present this in decreasing order based on the fully vaccinated percent. Each line of the output should contain the county name, the state, and percent of fully vaccinated people, each separated with a comma. Note that the latest date will have the cumulative data. (10 points)

G. Extend the previous command line to edit each output line, adding a pipe (|) symbol at the beginning and the end, and replacing the comma(s) with a pipe symbol. Copy and paste that output into your solution.md markdown file. Prepend two lines to it to create a nice table like the one below (created with the data on August 23, 2021). You should not have to edit the output of your commandline - you should just add the header row. (10 points)

You can read about Markdown tables here.

County State Fully-Vaccinated (%)
Chattahoochee County GA 99.9
Arecibo Municipio PR 93.6
McKinley County NM 93.4
Bristol Bay Borough AK 87.3
San Juan County CO 82.7
Santa Cruz County AZ 81.5
Martin County NC 77.7
Hamilton County NY 76.2
Teton County WY 75.3
Marin County CA 74.8

H. Write a bash script called query.sh that takes the name of a state and outputs the number of fully vaccinated people for this state based on the latest cumulative data. It can also take date as an additional parameter, in which case it will output the number of fully vaccinated people on that date for the specified state. (40 points)

Here are some example outputs by running the script on August 23, 2021:

Similar to question D, E, and F, you should think about how to get the latest date.

$ ./query.sh
Incorrect number of arguments. Usage: ./query.sh state [date]
$ ./query.sh Hanover
Hanover state does not exist
$ ./query.sh NH 
NH: 805909
$ ./query.sh NH 06/01/2021
NH: 763898
$ ./query.sh CA
CA: 25731391
$ ./query.sh CA 20-3123
This date (20-3123) does not exist for CA

Things to note:

  • Your script should have a brief header comment giving the script name, your name, the date, and a short summary of how someone can/should use the script.
  • Your script should print an error and exit non-zero, if the number of arguments is less than 1 or greater than 2.
  • Your script should print an error message and exit non-zero, if vaccine.csv is not an existing, readable file.
  • Your script should print an error message and exit non-zero, if it does not find the state specified by the first parameter.
  • Your script should print an error message and exit non-zero, if it does not find the date specified by the second parameter.
  • Your script should exit with a zero status, otherwise.

Other items, such as following delivery related instructions: (5 points)

What to hand in, and how

You should have three files in your lab1-XXXX directory:

  • edit README.md to remove instructions, add your name, add your username.

  • create solution.md with the answers to items A-G. For each, include a subsection header and show the commandline but do not include the command output. This is a “Markdown” file and you should use Markdown formatting. Notably, use code blocks to format the commands, like those you see below. You can preview it with various Markdown-rendering tools (see: Markdown resources) but we will read it on GitHub.com, so make sure it looks good there.

  • write query.sh with the script for item H.

You should add only these three files to your repo:

git add README.md solution.md query.sh

Please do not add your .csv file; it is large and, of course, we can download our own copy.

Commit your changes:

git commit -m "your commit message"

Push your changes to GitHub:

git push

Actually, if it is your first push, it will remind you to

git push --set-upstream origin master

Make sure you left nothing unexpected behind:

git status

If you need to make updates, repeat the add, commit, push sequence.

You can verify that it safely uploaded by visiting your private lab repo on GitHub.

If you need to submit after the deadline …

Your commit message should say “PLEASE GRADE THIS COMMIT.” Our graders will grade the last commit made before the deadline, unless they see that message on a late commit; they will grade the latest such commit that is less than 72h after the deadline. Late commits without such a comment will be ignored.

Hints

You will find some of the following commands useful; use man [cmd] to read about any command. It’s best to run man inside Linux so you are sure to get the manual for the Linux version of the command (MacOS can differ).

  • less
  • cut
  • head
  • tail
  • grep (note -n)
  • wget
  • sort
  • uniq
  • tr
  • sed
  • wc (note -l)

grep and sed depend on regular expressions. It is helpful to remember that ^ anchors a pattern to the start of a line and $ anchors to the end of the line.

Most Unix tools work line-by-line. For some problem(s), I found it helpful to translate the csv header line into a sequence of lines, on which I could operate with other tools.

Read about Markdown, and about Markdown tables.