Chapter 12. Shell scripts

Many programming languages are promoted as being suited for rapid development. In a sense, I agree with each such claim. For simpler errands, I’ll use a shell script, giving me access to all languages and tools at once: sed, awk, bash, Python, Perl, bc, grep, tr, cut, uniq, etc. Each instrument works in concert with others to perform complex symphonies.

Long ago, Doug McIlroy wrote: “This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.”

Years later, Rob Pike observed: “Those days are dead and gone and the eulogy was delivered by Perl.” His statement mostly stands, except Python is the new Perl. In “UNIX Style, or cat -v Considered Harmful”, he also laments that the original sharp focus of classic tools has been irrevocably blurred by a series of unfortunate design choices.

Fortunately, the reports of the demise of shell scripts are greatly exaggerated. Firstly, their most profound contribution, pipelines, lives on in modern languages as incarnations such as Go’s channels, or Haskell’s lazy evaluation. Secondly, shell scripts appear to have many other die-hard fans. See Greg’s Wiki; Bash cures cancer; the Advanced Bash-Scripting Guide; shell-fu.

Bash is my first refuge when a task seems too bothersome for C. I have little experience with other shells. There may be better choices, but Bash is widespread and good enough.

In defence of pipes

The highly enjoyable UNIX-HATERS Handbook dedicates a section to mocking Unix pipes.

They open with a jibe that it takes 4 commands worth of line noise to rename all .pas files to .p files, and also poke fun at somebody who had an “epiphany” while writing such a script.

I agree that renaming files via a script is nothing to write home about, unless one is accustomed to renaming via a GUI file manager (seeing as the former can be done with O(1) effort, while the latter requires O(N) clicks). However, I disagree it requires 4 noisy commands. On the contrary:

$ rename 's/as$//' *.pas

The “rename” command is actually a Perl script, but who cares?

Later, they rebut a criticism leveled at classic Macs by stating that despite the lack of pipes, it is easy to paste artwork into a memo and have text flow around it.

Firstly, building user-friendly tools for image manipulation is orthogonal to pipes, as the latter are by no means the only means for applications to share data. A modern Mac is powered by a form of Unix, complete with pipes, yet it also exhibits a slick polished GUI that makes it easy to paste artwork into a memo and have text flow around it.

Secondly, even with images, some tasks are easier with scripts. For example, I once took many photos with a digital camera. Thanks to the netpbm tools, in no time I whipped up a script to copy the latest photos to subdirectories according to the day they were taken, shrink each one to a certain size, then link them to a HTML page that showed thumbnails linked to each picture. I have no idea how to do all this in one step with a typical GUI.

They claim the “correct model” is to implement procedure calls with first-class structures. Though they stress the importance of making things easier for the user (“Unix was designed to please programmers, whereas the Mac was designed to please users.”), they seem to forget programmers are users too. Indeed, programmers probably use computers more frequently and for longer durations than average users. I would rather string together a handful of commands than deal with boilerplate of yet another language.

Programmers are wowed when they first learn coroutines or lazy lists. Rightly so, as their expressiveness is impressive. As we’ll see, pipes are equivalent constructs. Pipes supply simple syntax for advanced techniques. How can this be bad?

Subsitution ciphers

An easy way to permute of the alphabet is to pick a keyword, such as “keyword”, and append the remaining letters of the alphabet in order. In other words:

$ echo keyword$(echo {a..z} | tr -d 'keyword ')
keywordabcfghijlmnpqstuvxz

We can encipher with tr:

$ perm=$(echo keyword$(echo {a..z} | tr -d 'keyword '))
$ echo the quick brown fox jumps over the lazy dog | tr a-z $perm
qao msbyf enjui rjv cshlp jton qao gkzx wjd

More interesting is the reverse: given the ciphertext how do we recover the message without the key?

Let’s try frequency analysis. Start by printing the letter histogram:

$ c="qao msbyf enjui rjv cshlp jton qao gkzx wjd"
$ echo $c | fold -w 1 | sort | uniq -c | sort -n

Strange: in the first column, there are many 1s, four 2s, one 3 and one 4. Is every letter present? Filtering the output through wc -l yields 27. So yes, each letter of the alphabet occurs, along with the space. There are few short English sentences which use the entire alphabet, and only one famous one, which turns out to be the solution.

Suppose, however, that I only managed to figure out a few letters, say, that “o” is “e”, “n” is “r”, “j” is “o”. I could run:

$ echo $c | tr onj ERO
qaE msbyf eROui rOv cshlp OtER qaE gkzx wOd

which may lead me to guess “OtER” is “OVER”. Of course, I can translate away the lowercase letters if I find them distracting:

$ echo $c | tr onj ERO | tr a-z .
..E ..... .RO.. .O. ..... O.ER ..E .... .O.

Gray code

Suppose we have a script named “gray”:

#!/bin/bash

f() {
  echo $1 | tr ' ' '\n' | sed 's/^/0/'
  echo $1 | tr ' ' '\n' | tac | sed 's/^/1/'
}

a=
for i in $(seq $1); do a=$(f "$a"); done
echo $a

Running “./gray 5” will print a 5-bit Gray code. I needed the “tr” commands because Bash unfortunately converts newlines to spaces during command substitution (this, not pipes, is a Unix wart that deserves derision; thankfully they tend to be relatively insignificant). Try generating Gray code via pointing and clicking!

Lazy evaluation

Consider the following script, which we’ll call “up”:

#!/bin/bash
i=1
while true; do
  echo $i
  i=$((i+1))
done

This loops infinitely, printing the positive integers in ascending order. We used while for clarity; one might prefer a one-liner with a similar effect:

$ yes '' | nl -ba

Why write a program that never terminates? If we run:

$ ./up | sed 16q | while read n; do echo $((n*n)); done

then we see only the first 16 squares. The system maintains a buffer between the up script and the sed command, both of which are executed in parallel. If the buffer becomes full, up is suspended until processes down the pipeline have started to read from the buffer. If the downstream process terminates, then the upstream ones are no longer needed and are also terminated.

Go’s channels are neater because of the presence of types: in a sense, each Go channel holds at most one item while waiting for it to be read. Nonetheless, despite the slop, the Unix pipeline buffering mechanism works well in practice.

Not all systems handle pipes in this manner. MS-DOS, being a single-task operation system, waits for the one program to terminate before feeding its entire output to the next program. This example would fail, as the first command in the pipeline never terminates.

Above, we used an infinite loop to illustrate a point. If we really did want the first 16 squares, it would be easier to run:

$ for n in $(seq 16); do echo $((n*n)); done

Sieve of Erastothenes

So far, we’ve been using standard input and standard output as a text channel between two programs. We can solve surprisingly many problems, such as producing a letter frequency histogram. However, some problems require more plumbing. How do we add more channels?

One answer is mkfifo, which creates named pipes. We can then move data in more complex ways. For instance, we can mimic the Sieve of Erasothenes example from the official Go tutorial, a series of filters running in parallel, one per prime:

#!/bin/bash
generate() {
  i=2
  while true; do
    echo $i
    i=$((i+1))
  done
}

filter() {
  while read a; do
    if [ $((a%$1)) -ne 0 ]; then
      echo $a
    fi
  done
}

dir=$(mktemp -d)
ch=1
mkfifo $dir/$ch
generate > $dir/$ch &
while true; do
  exec 3<$dir/$ch
  read p <&3
  echo $p
  ch1=$((ch+1))
  mkfifo $dir/$ch1
  filter $p <&3 > $dir/$ch1 &
  ch=$ch1
done

Like their Go counterparts, the generate and filter functions are infinite loops that can block when they attempt to send output.

Above, we cannot write something like:

cat $dir/$ch | read p

because such a command automatically closes the file descriptor after it has finished, with undesirable consequences for processes attempting to send us output. Instead, we use file descriptor 3 to keep the named pipe open even after we have read its first output, so we can later redirect it.

As in the tutorial, we can also rewrite the program so each function creates its output channel:

#!/bin/bash
dir=$(mktemp -d)

makechan() {
  ch=$(mktemp -d --tmpdir=$dir)/fifo
  mkfifo $ch
}

generate() {
  makechan
  anon() {
    i=2
    while true; do
      echo $i
      i=$((i+1))
    done
  }
  anon > $ch &
}

filter() {
  makechan
  anon() {
    while read a; do
      if [ $((a%$1)) -ne 0 ]; then
        echo $a
      fi
    done
  }
  anon $1 <&0 > $ch &
}

sieve() {
  makechan
  anon() {
    generate
    while true; do
      exec 3<$ch
      read p <&3
      echo $p
      filter $p <&3
    done
  }
  anon > $ch &
}

sieve
cat $ch