Don’t forget this week’s reading.

In this lecture, we continue our discussion of the Unix shell and its commands.

Goals

We plan to learn the following today:

  • Redirection and pipes
  • Special characters and quoting
  • Standard input, output, and error

We’ll do this activity in today’s class:

  • Sit with your group and experiment with shell pipelines in this activity.

Redirection and pipes

To date we have seen Unix programs using default input and output - called standard input (stdin) and standard output (stdout) - the keyboard is the standard input and the display the standard output. The Unix shell is able to redirect both the input and output of programs. As an example of output redirection consider the following.

f00xxxx@plank:~/cs50-dev/demo$ ls
students  test/
f00xxxx@plank:~/cs50-dev/demo$ date
Mon Aug 16 11:29:04 EDT 2021
f00xxxx@plank:~/cs50-dev/demo$ date > today
f00xxxx@plank:~/cs50-dev/demo$ ls
students  test/  today
f00xxxx@plank:~/cs50-dev/demo$ cat today 
Mon Aug 16 11:29:15 EDT 2021
f00xxxx@plank:~/cs50-dev/demo$ ls -lR test/ >> today 
f00xxxx@plank:~/cs50-dev/demo$ cat today 
Mon Aug 16 11:29:15 EDT 2021
test/:
total 34
-rw-r--r-- 1 f00xxxx thayerusers 6 Aug 14 07:04 somefile

The output redirection > writes the output of date to the file called today; that is, the ‘standard output’ of the date process has been directed to the file instead of the default, the display. Note that the > operation created a file that did not exist before the output redirection command was executed. Next, we append a recursive, long-format directory listing to the same file; by using the >> (double >) we tell the shell to append to the file rather than overwriting the file.

Note that the > or >> and their target filenames are not arguments to the command - the command simply writes to stdout, as it always does, but the shell has arranged for stdout to be directed to a file instead of the terminal.

The shell also supports input redirection. This provides input to a program (rather than the keyboard). Let’s create a file of prime numbers using output redirection. The input to the cat command can come from the standard input (i.e., the keyboard). We can instruct the shell to redirect the cat command’s output (stdout) to file named primes. The numbers are input at the keyboard and CTRL-D is used to signal the end of the file (EOF).

f00xxxx@plank:~/cs50-dev/demo$ cat > primes
61
53
41
2
3
11
13
f00xxxx@plank:~/cs50-dev/demo$ ls
primes  students  test/  today
f00xxxx@plank:~/cs50-dev/demo$ cat primes 
61
53
41
2
3
11
13
f00xxxx@plank:~/cs50-dev/demo$ 

Input redirection < tells the shell to use a file as input to the command rather than the keyboard. In the input redirection example below primes is used as input to cat which sends its standard output to the screen.

f00xxxx@plank:~/cs50-dev/demo$ cat < primes
61
53
41
2
3
11
13

Many Unix commands (e.g., cat, sort) allow you to provide input from stdin if you do not specify a file on the command line. For example, if you type cat (CR) (carriage return) then the command expects input from the standard input.

Unix also supports a powerful ‘pipe’ operator for passing data between commands using the operator | (a vertical bar, usually located above the \ key on your keyboard). Pipes connect commands that run as separate processes as data becomes available the processes are scheduled.

Pipes were invented by Doug McIlroy while he was working with Ken Thompson and Dennis Ritchie at AT&T Bell Labs. (As I mentioned earlier, Doug has been an adjunct professor here at Dartmouth College for several years now.) In this two-page interview, at the middle of the third column, Doug tells how pipes were invented and the | character selected as the operator. Pay special attention to the next paragraph: the Dartmouth Time Sharing System had something similar, even earlier!

Pipes are a clever invention indeed, since the need for separate temporary files for sharing data between processes is not required. Because commands are implemented as processes, a program reading an empty pipe will be suspended until there is data or information ready for it to read. There is no limit to the number of programs or commands in the pipeline. In our example below there are four programs in the pipeline all running simultaneously waiting on the input:

f00xxxx@plank:~$ cat >> primes
61
53
41
3
f00xxxx@plank:~$ more primes
61
53
41
2
3
11
13
61
53
41
3
f00xxxx@plank:~$ sort -n primes | uniq | grep -v 13
2
3
11
41
53
61

What is the difference between pipes and redirection? Basically, redirection (>,>>,<) is used to direct the stdout of command to a file, or from a file to the stdin of a command. Pipes (|) are used to redirect the stdout to the stdin of another command. This operator allows us to ‘glue’ together programs as ‘filters’ to process the plain text sent between them (plain text between the processes - a nice design decision). This supports the notion of reuse and allows us to build sophisticated programs quickly and simply. It’s another cool feature of Unix.

Notice three new commands above: sort, uniq, and grep.

  • sort reads lines from stdin and outputs the lines in sorted order; here -n tells sort to use numeric order (rather than alphabetical order);
  • uniq removes duplicates, printing only one of a run of identical lines;
  • grep prints lines matching a pattern (more generally, a regular expression); here, -v inverts this behavior: print lines that do not match the pattern. In this case, the pattern is simply 13 and grep does not print that number as it comes through.

And, as we saw last time, more pauses the output when it would scroll off the screen.

Note that the original file - primes - is not changed by executing the command line above. Rather, the file is read in by the sort command and the data is manipulated as it is processed by each stage of the command pipe line. Because sort and cat are happy to read its input data from stdin, or from a file given as an argument, the following pipelines all achieve the same result:

f00xxxx@plank:~$ sort -n < primes       | uniq | grep -v 13 | more
f00xxxx@plank:~$ cat   primes | sort -n | uniq | grep -v 13 | more
f00xxxx@plank:~$ cat < primes | sort -n | uniq | grep -v 13 | more

Which do you think would be most efficient?

Another pipeline: How to get the list of existing usernames on a machine?

f00xxxx@plank:~$ cut -d : -f 1 /etc/passwd | sort > usernames.txt

See man cut to understand what the first command does.

Another example: what is the most popular shell? Try each of these in turn:

f00xxxx@plank:~$ cut -d : -f 7 /etc/passwd
f00xxxx@plank:~$ cut -d : -f 7 /etc/passwd | less
f00xxxx@plank:~$ cut -d : -f 7 /etc/passwd | sort
f00xxxx@plank:~$ cut -d : -f 7 /etc/passwd | sort | uniq -c
f00xxxx@plank:~$ cut -d : -f 7 /etc/passwd | sort | uniq -c | sort -n
f00xxxx@plank:~$ cut -d : -f 7 /etc/passwd | sort | uniq -c | sort -nr

Another example The command wc is useful to print a count of the number of newlines, words or bytes in a file.

f00xxxx@plank:~/cs50-dev/demo$ cat primes 
61
53
41
2
3
11
13
61
53
41
3
f00xxxx@plank:~/cs50-dev/demo$ wc -l primes 
11 primes

see man wc for more details.

Standard input, output and error

As we learned above, every process (a running program) has a standard input (abbreviated to stdin) and a standard output (stdout). The shell sets stdin to the keyboard by default, but the command line can tell the shell to redirect stdin using < or a pipe. The shell sets stdout to the display by default, but the command line can tell the shell to redirect stdout using > or >>, or to a pipe.

Each process also has a standard error (stderr), which most programs use for printing error messages. The separation of stdout and stderr is important when stdin is redirected to a file or pipe, because normal output can flow into the file or pipe while error messages reach the user on the screen.

Inside the running process these three streams are represented with numeric file descriptors:

0: stdin 1: stdout 2: stderr

You can tell the shell to redirect using these numbers; > is shorthand for 1> and < is shorthand for 0<. You can thus redirect the standard error (file descriptor 2) with the symbol 2>. Suppose I was logged in and wanted to learn what the find command does, I accidentally type msn instead of man. I can use the below to redirect the error message into a new file call errors.

f00xxxx@plank:~/cs50-dev/demo$ msn find

Command 'msn' not found, did you mean:

  command 'mon' from deb mon
  command 'sn' from deb mono-devel
  command 'mn' from deb mininet
  command 'man' from deb man-db
  command 'mvn' from deb maven
  command 'mln' from deb mmv
  command 'msb' from deb mysql-sandbox
  command 'mhn' from deb mailutils-mh
  command 'mhn' from deb nmh
  command 'mtn' from deb monotone

Try: apt install <deb name>

Suppose I wanted to capture those errors in a file:

f00xxxx@plank:~/cs50-dev/demo$ msn find 2> errors
f00xxxx@plank:~/cs50-dev/demo$ cat errors

Command 'msn' not found, did you mean:

  command 'mn' from deb mininet
  command 'man' from deb man-db
  command 'mhn' from deb mailutils-mh
  command 'mhn' from deb nmh
  command 'mon' from deb mon
  command 'sn' from deb mono-devel
  command 'mvn' from deb maven
  command 'msb' from deb mysql-sandbox
  command 'mln' from deb mmv
  command 'mtn' from deb monotone

Try: apt install <deb name>

The file errors contains the error messages we saw earlier.

As another alternative, we could ignore the error output entirely, we can redirect it to a place where all characters go and never return!

f00xxxx@plank:~/cs50-dev/demo$ ls
errors  primes  students  test/  today
f00xxxx@plank:~/cs50-dev/demo$ rm errors
rm: remove regular file 'errors'? y
f00xxxx@plank:~/cs50-dev/demo$ msn find 2> /dev/null

The file called /dev/null is a special kind of file - it’s not a file at all, actually, it’s a ‘device’ that simply discards anything written to it. (If you read from it, it appears to be an empty file.)

Special characters

There are a number of special characters interpreted by the shell - spaces, tabs, wildcard (‘globbing’) characters for filename expansion, redirection symbols, and so forth. Special characters have special meaning and cannot be used as regular characters because the shell interprets them in a special manner. These special characters include:

& ; | * ? ` " ' [ ] ( ) $ < > { } # / \ ! ~

We have already used several of these special characters. Don’t try to memorize them at this stage. Through use, they will become second nature. We will just give some examples of the ones we have not discussed so far.

Quoting

If you need to use one of these special characters as a regular character, you can tell the shell not to interpret it by escaping or quoting it. To escape a single special character, precede it with a backslash \; earlier we saw how to escape the character * with \*. To escape multiple special characters (as in **), escape each: \*\*. You can also quote using single quotation marks such as '**' or double quotation marks such as "**" - but these have subtlety different behavior. You might use this form when quoting a filename with embedded spaces: "Homework assignment".

You will often need to pass special characters as part of arguments to commands and other programs - for example, an argument that represents a pattern to be interpreted by the command; as happens often with find and grep.

There is a situation where single quotes work differently than double quotes. If you use a pair of single quotes around a shell variable substitution (like $USER), the variable’s value will not be substituted, whereas it would be substituted within double quotes:

f00xxxx@plank:~/cs50-dev/demo$ echo "$LOGNAME uses $SHELL shell and the home directory is $HOME"
f00xxxx uses /bin/bash shell and the home directory is /thayerfs/home/f00xxxx
f00xxxx@plank:~/cs50-dev/demo$ echo '$LOGNAME uses $SHELL shell and the home directory is $HOME'
$LOGNAME uses $SHELL shell and the home directory is $HOME

Example 1. Double-quotes are especially important in shell scripts, because the variables involved might have been user input (a command-line argument or a keyboard input) or might have be a file name or output of a command; such variables should always be quoted when substituted, because spaces (and other special characters) embedded in the value of the variable can cause confusion. Thus:

f00xxxx@plank:~/cs50-dev/demo$ dirName="Homework three"
f00xxxx@plank:~/cs50-dev/demo$ mkdir "dirName"
f00xxxx@plank:~/cs50-dev/demo$ ls
dirName/  primes  students  test/  today
f00xxxx@plank:~/cs50-dev/demo$ mkdir "$dirName"
f00xxxx@plank:~/cs50-dev/demo$ ls
 dirName/  'Homework three'/   primes   students   test/   today
f00xxxx@plank:~/cs50-dev/demo$ mkdir $dirName
f00xxxx@plank:~/cs50-dev/demo$ ls
 dirName/   'Homework three'/   students   three/
 Homework/   primes             test/      today

Try it!

Example 2. Escapes and quoting can pass special characters and patterns passed to commands.

Suppose I have a list of email addresses in email.txt, one per line.

Apple Banana <AppleBanana@dartmouth.edu>
Cherry Drink <CherryDrink@dartmouth.edu>
Extra Sauce <ExtraSauce@dartmouth.edu>

I can make the addresses in one line separated by comma, using the tr command:

f00xxxx@plank:~/cs50-dev/demo$ tr "\n" , < email.txt
Apple Banana <AppleBanana@dartmouth.edu>,Cherry Drink <CherryDrink@dartmouth.edu>,Extra Sauce <ExtraSauce@dartmouth.edu>,

The tr command translates each instance of the character given in the first argument (\n) to the character given in the second argument (,). \n is a single special character in Unix and is called ‘newline’. It defines the end of one line and the beginning of the next. We used double-quotes to include it since \ is a special character in bash and we need to escape it. Without the quotes, we will need to do \n.

It’s important to note that tr command does not take file names as arguments, so we have to use input redirect for the command to take the content from email.txt.

Example 3. An even more powerful filtering tool - the stream editor called sed - allows you to transform occurrences of one or more patterns in the input file(s):

	sed pattern [file]...  

Here are some examples with the email.txt file:

f00xxxx@plank:~/cs50-dev/demo$ ls
email.txt  primes  students  test/  today
f00xxxx@plank:~/cs50-dev/demo$ cat email.txt 
Apple Banana <AppleBanana@dartmouth.edu>
Cherry Drink <CherryDrink@dartmouth.edu>
Extra Sauce <ExtraSauce@dartmouth.edu>

# remove Cherry
f00xxxx@plank:~/cs50-dev/demo$ sed '/Cherry/d' email.txt
Apple Banana <AppleBanana@dartmouth.edu>
Extra Sauce <ExtraSauce@dartmouth.edu>

# remove Cherry, remove names,
f00xxxx@plank:~/cs50-dev/demo$ sed -e '/Cherry/d' -e 's/.*<//' email.txt
AppleBanana@dartmouth.edu>
ExtraSauce@dartmouth.edu>

# remove Cherry, remove names, make comma-sep list of addresses
f00xxxx@plank:~/cs50-dev/demo$ sed -e '/Cherry/d' -e 's/.*<//' -e 's/>.*/,/' email.txt
AppleBanana@dartmouth.edu,
ExtraSauce@dartmouth.edu,

The above uses the -e switch to sed, which allows one to list more than one pattern on the same command.

A few quick notes about sed’s patterns:

  • d deletes lines matching the pattern
  • p prints lines matching the pattern (useful with -n)
  • s substitutes text for matches to the pattern.

The above uses regular expressions to specify patterns, including excess white space ( *), any string before < character (.*<), and any string after > character (>.*). Please refer to this summary on regular expressions so that you know how to use it too! Example 4.

I saved a list of made-up students enrolled in CS50 in the file students_email.txt. Each line is of the form First.M.Last.xx@Dartmouth.edu;.

Let’s suppose they all decide to move to Harvard.

# Replace Dartmouth with Harvard
f00xxxx@plank:~/cs50-dev/demo$ sed s/Dartmouth/Harvard/ students_email.txt 
Ak.R.Ak.21@Harvard.edu;
Me.Y.An.21@Harvard.edu;
Al.U.Ar.22@Harvard.edu;
Ar.T.At.23@Harvard.edu;
De.G.Br.23@Harvard.edu;
Mi.B.Ce.23@Harvard.edu;

# Replace Dartmouth with Harvard, remove the two-digit class number, remove the semicolon
f00xxxx@plank:~/cs50-dev/demo$ sed -e 's/Dartmouth/Harvard/' -e 's/\.[0-9][0-9]//' -e 's/;$//' students_email.txt
Ak.R.Ak@Harvard.edu
Me.Y.An@Harvard.edu
Al.U.Ar@Harvard.edu
Ar.T.At@Harvard.edu
De.G.Br@Harvard.edu
Mi.B.Ce@Harvard.edu

Notice how I quoted those patterns from the shell, and even escaped the dot from sed’s normal meaning (dot matches any character) so sed would look for a literal dot in that position.

Here’s another fun pipe: count the number of students from each class (leveraging the class numbers in email addresses). Here ^$ is a pattern described by regular expression, where ^ and $ represents the beginning and the end of a line, thus ^$ means an empty line.

f00xxxx@plank:~/cs50-dev/demo$ tr -c -d 0-9\\n < students_email.txt | sed 's/^$/other/'  | sort  | uniq -c | sort -nr
      3 23
      2 21
      1 22

See man sed or the sed FAQ for more info. You’ll want to learn a bit about regular expressions, which are used to describe patterns in sed’s commands; see sed regexp info.

Copying with scp

If you want to copy files (or directories) from your laptop to the Linux server, or from the Linux server to your laptop, use the scp command. For our purposes the syntax is

scp [-r] [username=@hostname:]filename [username@hostname:]filename

which specifies the source and destination of the copy. For an argument including a hostname, the filename is relative to the user’s home directory. If you want to specify a file in another directory, write out its pathname (relative or absolute). Note the :, this is how scp distinguishes between local and remote locations.

In one common case, you might copy a file program.c from your laptop to plank:

scp program.c netID@plank.thayer.dartmouth.edu:~/cs50-dev/demo

Or in the opposite direction:

scp netID@plank.thayer.dartmouth.edu:~/cs50-dev/demo/program.c program_copy.c

You can copy a directory, recursively, with the -r flag:

scp -r netID@plank.thayer.dartmouth.edu:~/cs50-dev/demo demo_copy.c

Historical note

Another important moment in computing history, that happened at Dartmouth!

Dartmouth AI conference

Can you find the building?

Other things

There’s never enough time in class to show you everything.

Compressing and archiving files

It is often useful to bundle several files into a compressed archive file. You may have encountered files like this on the Internet - such as files.zip, something.tar, or somethin-else.tgz. Each packs together several files - including directories, creation dates, access permissions, as well as file contents - into one single file. This is a convenient way to transfer large collections of files.

On Unix it is most common to use the tar utility (short for tape archive - from back when we used tapes) to create an archive of all the files and directories listed in the arguments and name it to something appropriate. We often ask tar to compress the resulting archive too.

Given a directory stuff, you can create (c) a compressed tar archive (aka, a “tarball”), and then list (t) its contents. The f refers to a file and the z refers to compression.

$ mkdir stuff
$ echo > stuff/x
$ tar cfz stuff.tgz stuff
 98.8%
$ tar tfz stuff.tgz
stuff/
stuff/x

The command leaves the original directory and files intact. Notice that tar has an unconventional syntax for its switches - there is no dash - before them.

To unpack the archive,

$ tar xfz stuff.tgz

In short, c for create, t for type (list), x for extract. The f implies that the next argument is the tarball file name. The z indicates that the tarball should be compressed. By convention, a tarball filename ends in .tar if not compressed, .tgz if compressed.