Chapter 4 Going Further with the Unix Shell
There isn’t a way things should be. There’s just what happens, and what we do.
— Terry Pratchett
The previous chapters explained how we can use the command line to do all of the things we can do with a GUI, and how to combine commands in new ways using pipes and redirection. This chapter extends those ideas to show how we can create new tools by saving commands in files and how to use a more powerful version of wildcards to extract data from files.
We’ll be continuing to work in the
which after the previous chapter should contain the following files:
zipf/ └── data ├── README.md ├── dracula.txt ├── frankenstein.txt ├── jane_eyre.txt ├── moby_dick.txt ├── sense_and_sensibility.txt ├── sherlock_holmes.txt └── time_machine.txt
Deleting Extra Files
You may have additional files if you worked through all of the exercises in the previous chapter. Feel free to delete them or move them to a separate directory. If you have accidentally deleted files you need, you can download them again by following the instructions in Section 1.2.
4.1 Creating New Commands
Loops let us run the same commands many times, but we can go further and save commands in files so that we can repeat complex operations with a few keystrokes. For historical reasons a file full of shell commands is usually called a shell script, but it is really just another kind of program.
Let’s start by creating a new directory for our runnable programs called
consistent with the project structure described in Section 1.1.2.
Edit a new file called
book_summary.sh to hold our shell script:
and insert this line:
head -n 17 ../data/moby_dick.txt | tail -n 8
Note that we do not put the
$ prompt at the front of the line.
We have been showing that to highlight interactive commands,
but in this case we are putting the command in a file rather than running it immediately.
Empty Line at the End of a File?
You’ll often see scripts from many languages that end in an empty line. What you are seeing, though, is the last line of code ending in a newline character. This indicates to the computer that the code has ended. While this newline character is not required for shell scripts to work, and sometimes isn’t shown by coding tools, it does make it easier to view and modify scripts. When you are copying code from this book, remember to add an empty line at the end!
Once we have added this line,
we can save the file with Ctrl+O
and exit with Ctrl+X.
ls shows that our file now exists:
We can check the contents of the file using
we can now ask the shell to run this file:
Title: Moby Dick or The Whale Author: Herman Melville Editor: Release Date: December 25, 2008 [EBook #2701] Posting Date: Last Updated: December 3, 2017 Language: English
Sure enough, our script’s output is exactly the same text we would get if we ran the command directly. If we want, we can pipe the output of our shell script to another command to count how many lines it contains:
What if we want our script to print the name of the book’s author?
grep finds and prints lines that match a pattern.
We’ll learn more about
grep in Section 4.4,
but for now we can edit the script:
and add a search for the word “Author”:
head -n 17 ../data/moby_dick.txt | tail -n 8 | grep Author
Sure enough, when we run our modified script:
we get the line we want:
Author: Herman Melville
And once again we can pipe the output of our script into other commands just as we would pipe the output from any other program. Here, we count the number of words in the author line:
4.2 Making Scripts More Versatile
Getting the name of the author for only one of the books isn’t particularly useful.
What we really want is a way to get the name of the author from any of our files.
a special variable
Once our change is made,
book_summary.sh should contain:
head -n 17 $1 | tail -n 8 | grep Author
Inside a shell script,
$1 means “the first argument on the command line”.
If we now run our script like this:
$1 is assigned
and we get exactly the same output as before.
If we give the script a different filename:
we get the name of the author of that book instead:
Author: Mary Wollstonecraft (Godwin) Shelley
Our small script is now doing something useful, but it may take the next person who reads it a moment to figure out exactly what. We can improve our script by adding comments at the top:
# Get author information from a Project Gutenberg eBook. # Usage: bash book_summary.sh /path/to/file.txt head -n 17 $1 | tail -n 8 | grep Author
As in R and Python,
a comment starts with a
# character and runs to the end of the line.
The computer ignores comments,
but they help people (including our future self) understand and use what we’ve created.
Let’s make one more change to our script. Instead of always extracting the author name, let’s have it select whatever information the user specified:
# Get desired information from a Project Gutenberg eBook. # Usage: bash book_summary.sh /path/to/file.txt what_to_look_for head -n 17 $1 | tail -n 8 | grep $2
The change is very small:
we have replaced the fixed string ‘Author’ with a reference to the special variable
which is assigned the value of the second command-line argument we give the script when we run it.
Update Your Comments
As you update the code in your script, don’t forget to update the comments that describe the code. A description that sends readers in the wrong direction is worse than none at all, so do your best to avoid this common oversight.
Let’s check that it works by asking for Frankenstein’s release date:
Release Date: June 17, 2008 [EBook #84]
4.3 Turning Interactive Work into a Script
Suppose we have just run a series of commands that did something useful, such as summarizing all books in a given directory:
authors.txt book_summary.sh releases.txt
Instead of typing those commands into a file in an editor
(and potentially getting them wrong)
we can use
history and redirection to save recent commands to a file.
we can save the last six commands to
297 for x in ../data/*.txt; do echo $x; bash book_summary.sh $x Author; done > authors.txt 298 for x in ../data/*.txt; do echo $x; bash book_summary.sh $x Release; done > releases.txt 299 ls 300 mkdir ../results 301 mv authors.txt releases.txt ../results 302 history 6 > summarize_all_books.sh
We can now open the file in an editor, remove the serial numbers at the start of each line, and delete the lines we don’t want to create a script that captures exactly what we actually did. This is how we usually develop shell scripts: run commands interactively a few times to make sure they are doing the right thing, then save our recent history to a file and turn that into a reusable script.
4.4 Finding Things in Files
We can use
tail to select lines from a file by position,
but we also often want to select lines that contain certain values.
This is called filtering,
and we usually do it in the shell with the command
that we briefly met in Section 4.1.
Its name is an acronym of “global regular expression print”,
which was a common sequence of operations in early Unix text editors.
To show how
we will use our sleuthing skills to explore
let’s find lines that contain the word “Sherlock”.
Since there are likely to be hundreds of matches,
we will pipe
grep’s output to
head to show only the first few:
Sherlock is our (very simple) pattern.
grep searches the file line by line
and shows those lines that contain matches,
so the output is:
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle Title: The Adventures of Sherlock Holmes To Sherlock Holmes she is always THE woman. I have seldom heard as I had pictured it from Sherlock Holmes' succinct description, "Good-night, Mister Sherlock Holmes."
If we run
grep sherlock instead we get no output,
grep patterns are case-sensitive.
If we wanted to make the search case-insensitive,
we can add the option
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle Title: The Adventures of Sherlock Holmes *** START OF THIS PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES *** THE ADVENTURES OF SHERLOCK HOLMES To Sherlock Holmes she is always THE woman. I have seldom heard
This output is different from our previous output because of the lines containing “SHERLOCK” near the top of the file.
Next, let’s search for the pattern
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or with this eBook or online at www.gutenberg.net Author: Arthur Conan Doyle
In each of these lines,
our pattern (“on”) is part of a larger word such as “Conan”.
To restrict matching to lines containing
on by itself,
we can give
-w option (for “match words”):
One night--it was on the twentieth of March, 1888--I was put on seven and a half pounds since I saw you." that I had a country walk on Thursday and came home in a dreadful "It is simplicity itself," said he; "my eyes tell me that on the on the right side of his top-hat to show where he has secreted
What if we want to search for a phrase rather than a single word?
grep: the: No such file or directory data/sherlock_holmes.txt:Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle data/sherlock_holmes.txt:This eBook is for the use of anyone anywhere at no cost and with data/sherlock_holmes.txt:almost no restrictions whatsoever. You may copy it, give it away or data/sherlock_holmes.txt:with this eBook or online at www.gutenberg.net data/sherlock_holmes.txt:Author: Arthur Conan Doyle
In this case,
on as the pattern
and tries to find it in files called
It then tells us that the file
the cannot be found,
data/sherlock_holmes.txt as a prefix to each other line of output
to tell us which file those lines came from.
If we want to give
grep both words as a single argument,
we must wrap them in quotation marks as before:
One night--it was on the twentieth of March, 1888--I was drug-created dreams and was hot upon the scent of some new "It is simplicity itself," said he; "my eyes tell me that on the on the right side of his top-hat to show where he has secreted pink-tinted note-paper which had been lying open upon the table.
Quotation marks aren’t specific to
grep: the shell interprets them before running commands, just as it expands wildcards to create filenames no matter what command those filenames are being passed to. This allows us to do things like
head -n 5 "My Thesis.txt"to get lines from a file that has a space in its name. It is also why many programmers write
"$variable"instead of just
$variablewhen creating loops or shell scripts: if there’s any chance at all that the variable’s value will contain spaces, it’s safest to put it in quotes.
One of the most useful options for
which numbers the lines that match the search:
105:One night--it was on the twentieth of March, 1888--I was 118:drug-created dreams and was hot upon the scent of some new 155:"It is simplicity itself," said he; "my eyes tell me that on the 165:on the right side of his top-hat to show where he has secreted 198:pink-tinted note-paper which had been lying open upon the table.
grep has many options—so many,
that almost every letter of the alphabet means something to it:
GREP(1) BSD General Commands Manual GREP(1) NAME grep, egrep, fgrep, zgrep, zegrep, zfgrep -- file pattern searcher SYNOPSIS grep [-abcdDEFGHhIiJLlmnOopqRSsUVvwxZ] [-A num] [-B num] [-C[num]] [-e pattern] [-f file] [--binary-files=value] [--color[=when]] [--colour[=when]] [--context[=num]] [--label] [--line-buffered] [--null] [pattern] [file ...] ...more...
We can combine options to
grep as we do with other Unix commands.
we can combine two options we’ve covered previously with
to invert the match—i.e.,
to print lines that don’t match the pattern:
2: 4:almost no restrictions whatsoever. You may copy it, give it away or 6:with this eBook or online at www.gutenberg.net 7: 8:
As we learned in Section 2.2,
we can write this command as
but probably shouldn’t for the sake of readability.
If we want to search several files at once,
all we have to do is give
grep all of their names.
The easiest way to do this is usually to use wildcards.
this command counts how many lines contain “pain” in all of our books:
-r option (for “recursive”) tells
grep to search all of the files
in or below a directory:
grep becomes even more powerful
when we start using regular expressions,
which are sets of letters, numbers, and symbols that define complex patterns.
this command finds lines that start with the letter ‘T’:
This eBook is for the use of anyone anywhere at no cost and with Title: The Adventures of Sherlock Holmes THE ADVENTURES OF SHERLOCK HOLMES To Sherlock Holmes she is always THE woman. I have seldom heard The distinction is clear. For example, you have frequently seen
-E option tells
grep to interpret the pattern as a regular expression,
rather than searching for an actual circumflex followed by an upper-case ‘T’.
The quotation marks prevent the shell from treating special characters in the pattern as wildcards,
^ means that a line only matches
if it begins with the search term—in this case,
Many tools support regular expressions: we can use them in programming languages, database queries, online search engines, and most text editors (though not Nano—its creators wanted to keep it as small as possible). A detailed guide of regular expressions is outside the scope of this book, but a wide range of tutorials are available online, and Goyvaerts and Levithan (2012) is a useful companion if you need to go further.
4.5 Finding Files
grep finds things in files,
find command finds files themselves.
It also has a lot of options,
but unlike most Unix commands they are written as full words
rather than single-letter abbreviations.
To show how it works,
we will use the entire contents of our
including files we created earlier in this chapter:
zipf/ ├── bin │ ├── book_summary.sh │ ├── summarize_all_books.sh ├── data │ ├── README.md │ ├── dracula.txt │ ├── frankenstein.txt │ ├── jane_eyre.txt │ ├── moby_dick.txt │ ├── sense_and_sensibility.txt │ ├── sherlock_holmes.txt │ └── time_machine.txt └── results ├── authors.txt └── releases.txt
For our first command,
find . to find and list everything in this directory.
. on its own means the current working directory,
which is where we want our search to start.
. ./bin ./bin/summarize_all_books.sh ./bin/book_summary.sh ./results ./results/releases.txt ./results/authors.txt ./data ./data/moby_dick.txt ./data/sense_and_sensibility.txt ./data/sherlock_holmes.txt ./data/time_machine.txt ./data/frankenstein.txt ./data/README.md ./data/dracula.txt ./data/jane_eyre.txt
If we only want to find directories,
we can tell
find to show us things of type
. ./bin ./results ./data
If we change
-type d to
we get a listing of all the files instead:
./bin/summarize_all_books.sh ./bin/book_summary.sh ./results/releases.txt ./results/authors.txt ./data/moby_dick.txt ./data/sense_and_sensibility.txt ./data/sherlock_holmes.txt ./data/time_machine.txt ./data/frankenstein.txt ./data/README.md ./data/dracula.txt ./data/jane_eyre.txt
Now let’s try matching by name:
./results/releases.txt ./results/authors.txt ./data/moby_dick.txt ./data/sense_and_sensibility.txt ./data/sherlock_holmes.txt ./data/time_machine.txt ./data/frankenstein.txt ./data/dracula.txt ./data/jane_eyre.txt
Notice the quotes around
If we omit them and type:
then the shell tries to expand the
* wildcard in
Since there aren’t any text files in the current directory,
the expanded list is empty,
so the shell tries to run the equivalent of
and gives us the error message:
find: -name: requires additional arguments
We have seen before how to combine commands using pipes. Let’s use another technique to see how large our books are:
14 ./results/releases.txt 14 ./results/authors.txt 22331 ./data/moby_dick.txt 13028 ./data/sense_and_sensibility.txt 13053 ./data/sherlock_holmes.txt 3582 ./data/time_machine.txt 7832 ./data/frankenstein.txt 15975 ./data/dracula.txt 21054 ./data/jane_eyre.txt 96883 total
When the shell executes our command,
it runs whatever is inside
and then replaces
$(...) with that command’s output.
Since the output of
find is the paths to our text files,
the shell constructs the command:
(We are using
... in place of seven files’ names in order to fit things neatly on the printed page.)
This results in the output as seen above.
It is exactly like expanding the wildcard in
but more flexible.
We will often use
The first command finds files whose names match a pattern,
while the second looks for lines inside those files that match another pattern.
we can look for
Authors in all our text files:
./results/authors.txt:Author: Bram Stoker ./results/authors.txt:Author: Mary Wollstonecraft (Godwin) Shelley ./results/authors.txt:Author: Charlotte Bronte ./results/authors.txt:Author: Herman Melville ./results/authors.txt:Author: Jane Austen ./results/authors.txt:Author: Arthur Conan Doyle ./results/authors.txt:Author: H. G. Wells ./data/moby_dick.txt:Author: Herman Melville ./data/sense_and_sensibility.txt:Author: Jane Austen ./data/sherlock_holmes.txt:Author: Arthur Conan Doyle ./data/time_machine.txt:Author: H. G. Wells ./data/frankenstein.txt:Author: Mary Wollstonecraft (Godwin) Shelley ./data/dracula.txt:Author: Bram Stoker ./data/jane_eyre.txt:Author: Charlotte Bronte
We can also use
$(...) expansion to create a list of filenames to use in a loop:
./results/releases.txt.bak ./results/authors.txt.bak ./data/frankenstein.txt.bak ./data/sense_and_sensibility.txt.bak ./data/dracula.txt.bak ./data/time_machine.txt.bak ./data/moby_dick.txt.bak ./data/jane_eyre.txt.bak ./data/sherlock_holmes.txt.bak
4.6 Configuring the Shell
As Section 3.3 explained, the shell is a program, and like any other program it has variables. Some of those variables control the shell’s operations; by changing their values we can change how the shell and other programs behave.
Let’s run the command
and look at some of the variables the shell defines:
COMPUTERNAME=TURING HOME=/Users/amira HOMEDRIVE=C: HOSTNAME=TURING HOSTTYPE=i686 NUMBER_OF_PROCESSORS=4 OS=Windows_NT PATH=/Users/amira/anaconda3/bin:/usr/bin: /bin:/usr/sbin:/sbin:/usr/local/bin PWD=/Users/amira UID=1000 USERNAME=amira ...
There are many more than are shown here—roughly a hundred
in our current shell session.
set to show things might seem a little strange,
even for Unix,
but if we don’t give it any arguments,
the command might as well show us things we could set.
shell variables that are always present have upper-case names.
All shell variables’ values are strings,
even those (such as
UID) that look like numbers.
It’s up to programs to convert these strings to other types when necessary.
if a program wanted to find out how many processors the computer had,
it would convert the value of
NUMBER_OF_PROCESSORS from a string to an integer.
Similarly, some variables (like
PATH) store lists of values.
In this case, the convention is to use a colon ‘:’ as a separator.
If a program wants the individual elements of such a list,
it must split the variable’s value into pieces.
Let’s have a closer look at
Its value defines the shell’s search path,
which is the list of directories that the shell looks in for programs
when we type in a command name without specifying exactly where it is.
when we type a command like
the shell needs to decide whether to run
./analyze (in our current directory)
/bin/analyze (in a system directory).
To do this,
the shell checks each directory in the
PATH variable in turn.
As soon as it finds a program with the right name,
it stops searching and runs what it has found.
To show how this works,
here are the components of
PATH listed one per line:
Suppose that our computer has three programs called
Since the shell searches the directories in the order they’re listed in
/bin/analyze first and runs that.
/Users/amira is not in our path,
Bash will never find the program
unless we type the path in explicitly
./analyze if we are in
If we want to see a variable’s value,
we can print it using the
introduced at the end of Section 3.5.
Let’s look at the value of the variable
which keeps track of our home directory:
Whoops: this just prints “HOME”, which isn’t what we wanted. Instead, we need to run this:
As with loop variables (Section 3.3),
the dollar sign before the variable names tells the shell
that we want the variable’s value.
This works just like wildcard expansion—the shell replaces
the variable’s name with its value before running the command we’ve asked for.
echo $HOME becomes
which displays the right thing.
Creating a variable is easy: we assign a value to a name using “=”, putting quotes around the value if it contains spaces or special characters:
To change the value, we simply assign a new one:
If we want to set some variables automatically every time we run a shell,
we can put commands to do this in a file called
.bashrc in our home directory.
here are two lines in
export DEPARTMENT="Library Science" export TEMP_DIR=/tmp export BACKUP_DIR=$TEMP_DIR/backup
These three lines create the variables
and export them so that any programs the shell runs can see them as well.
BACKUP_DIR’s definition relies on the value of
so that if we change where we put temporary files,
our backups will be relocated automatically.
this will only happen once we restart the shell,
.bashrc is only executed when the shell starts up.
What’s in a Name?
The ‘.’ character at the front of the name
lsfrom listing this file unless we specifically ask it to using
-a. The “rc” at the end is an abbreviation for “run commands”, which meant something really important decades ago, and is now just a convention everyone follows without understanding why.
While we’re here,
it’s also common to use the
to create shortcuts for things we frequently type.
we can define the alias
/bin/zback with a specific set of arguments:
Aliases can save us a lot of typing, and hence a lot of typing mistakes. The name of an alias can be the same as an existing command, so we can use them to change the behavior of a familiar command:
We can find interesting suggestions for other aliases by searching online for “sample bashrc”.
While searching for additional aliases,
you’re likely to encounter references to other common shell features to customize,
such as the color of your shell’s background and text.
As mentioned in Chapter 2,
another important feature to consider customizing is your shell prompt.
In addition to a standard symbol (like
your computer may include other information as well,
such as the working directory, username, and/or date/time.
If your shell does not include that information and you would like to see it,
or if your current prompt is too long and you’d like to shorten it,
you can include a line in your
.bashrc file that defines
This changes the prompt to include your username and current working directory:
As powerful as the Unix shell is,
it does have its shortcomings:
dollar signs, quotes, and other punctuation
can make a complex shell script look as though
it was created by a cat dancing on a keyboard.
it is the glue that holds data science together:
shell scripts are used to create pipelines from miscellaneous sets of programs,
while shell variables are used to do everything from
specifying package installation directories to managing database login credentials.
find may take some getting used to,
they and their cousins can handle enormous datasets very efficiently.
If you would like to go further,
Ray and Ray (2014) is an excellent general introduction,
while Janssens (2014) looks specifically at how to process data on the command line.
As with the previous chapter, extra files and directories created during these exercises may need to be removed when you are done.
4.8.1 Cleaning up
As we have gone through this chapter, we have created several files that we won’t need again. We can clean them up with the following commands; briefly explain what each line does.
4.8.2 Variables in shell scripts
Imagine you have a shell script called
script.sh that contains:
With this script in your
data directory, you type the following command:
Which of the following outputs would you expect to see?
- All of the lines between the first and the last lines of each file ending in
- The first and the last line of each file ending in
- The first and the last line of each file in the
- An error because of the quotes around
4.8.3 Find the longest file with a given extension
Write a shell script called
longest.sh that takes the name of a
directory and a filename extension as its arguments, and prints
out the name of the file with the most lines in that directory
with that extension. For example:
would print the name of the
.txt file in
data that has
the most lines.
4.8.4 Script reading comprehension
For this question, consider your
data directory once again.
Explain what each of the following three scripts would do when run as
bash script1.sh *.txt,
bash script2.sh *.txt, and
bash script3.sh *.txt respectively.
(You may need to search online to find the meaning of
Assume the following text from The Adventures of Sherlock Holmes
is contained in a file called
To Sherlock Holmes she is always THE woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler.
Which of the following commands would provide the following output:
and predominates the whole of her sex. It was not that he felt
grep "he" excerpt.txt
grep -E "he" excerpt.txt
grep -w "he" excerpt.txt
grep -i "he" excerpt.txt
4.8.6 Tracking publication years
In Exercise 3.8.6
you examined code that extracted the publication year from a list of book titles.
Write a shell script called
year.sh that takes any number of
filenames as command-line arguments,
and uses a variation of the code you used earlier to print a list
of the unique publication years appearing in each of those files separately.
4.8.7 Counting names
You and your friend have just finished reading Sense and Sensibility
and are now having an argument.
Your friend thinks that the elder of the two Dashwood sisters,
was mentioned more frequently in the book,
but you are certain it was the younger sister, Marianne.
sense_and_sensibility.txt contains the full text of the novel.
how would you tabulate the number of times each of the sisters is mentioned?
Hint: one solution might employ
wc and a
while another might utilize
There is often more than one way to solve a problem with the shell;
people choose solutions based on readability,
and what commands they are most familiar with.
4.8.8 Matching and subtracting
Assume you are in the root directory of the
Which of the following commands will find all files in
data whose names end in
but do not contain the word
find data -name '*e.txt' | grep -v machine
find data -name *e.txt | grep -v machine
grep -v "machine" $(find data -name '*e.txt')
- None of the above.
find pipeline reading comprehension
Write a short explanatory comment for the following shell script:
4.8.10 Finding files with different properties
find command can be given criteria called “tests”
to locate files with specific attributes,
such as creation time, size, or ownership.
man find to explore these,
then write a single command using
to find all files in or below your Desktop directory
that are owned by you and were modified in the last 24 hours.
Explain why the value for
-mtime needs to be negative.
4.9 Key Points
- Save commands in files (usually called shell scripts) for re-use.
bash filenameruns the commands saved in a file.
$@refers to all of a shell script’s command-line arguments.
$2, etc., refer to the first command-line argument, the second command-line argument, etc.
- Place variables in quotes if the values might have spaces or other special characters in them.
findlists files with specific properties or whose names match patterns.
$(command)inserts a command’s output in place.
grepselects lines in files that match patterns.
- Use the
.bashrcfile in your home directory to set shell variables each time the shell runs.
aliasto create shortcuts for things you type frequently.