Chapter 8 Automating Analyses

The three rules of the Librarians of Time and Space are: 1) Silence; 2) Books must be returned no later than the last date shown; and 3) Do not interfere with the nature of causality.

— Terry Pratchett

It’s easy to run one program to process a single data file, but what happens when our analysis depends on many files, or when we need to re-do the analysis every time new data arrives? What should we do if the analysis has several steps that we have to do in a particular order?

If we try to keep track of this ourselves we will inevitably forget some crucial steps and it will be hard for other people to pick up our work. Instead, we should use a build manager to keep track of what depends on what and run our analysis programs automatically. These tools were invented to help programmers compile complex software, but can be used to automate any workflow.

Our Zipf’s Law project currently includes these files:

zipf/
├── .gitignore
├── CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE.md
├── README.md
├── bin
│   ├── book_summary.sh
│   ├── collate.py
│   ├── countwords.py
│   ├── plotcounts.py
│   ├── script_template.py
│   └── utilities.py
├── data
│   ├── README.md
│   ├── dracula.txt
│   ├── frankenstein.txt
│   └── ...
└── results
    ├── dracula.csv
    ├── dracula.png
    └── ...

Now that the project’s main building blocks are in place, we’re ready to atomate our analysis using a build manager. We will use a program called Make to do this so that every time we add a new book to our data, we can create new plots and update our fits with a single command. Make works as follows:

  1. Every time the operating system creates, reads, or changes a file, it updates a timestamp on the file to show when the operation took place. Make can compare these timestamps to figure out whether files are newer or older than one another.

  2. A user can describe which files depend on each other by writing rules in a Makefile. For example, one rule could say that results/moby_dick.csv depends on data/moby_dick.txt, while another could say that the plot results/comparison.png depends on all of the CSV files in the results directory.

  3. Each rule also tells Make how to update an out-of-date file. For example, the rule for Moby Dick could tell Make to run bin/countwords.py if the result file is older than either the raw data file or the program.

  4. When the user runs Make, the program checks all of the rules in the Makefile and runs the commands needed to update any that are out of date. If there are transitive dependencies—i.e., if A depends on B and B depends on C—then Make will trace them through and run all of the commands it needs to in the right order.

This chapter uses a version of Make called GNU Make. It comes with macOS and Linux; please see Appendix E for Windows installation instructions.

Keep Tracking With Version Control

We learned about tracking our project’s version history using Git in Chapters 5 and 6. We encourage you to continue apply the Git workflow throughout the rest of our code development, though we won’t continue to remind you.

8.1 Updating a Single File

To start, let’s create a file called Makefile in the root of our project:

# Regenerate results for "Moby Dick"
results/moby_dick.csv : data/moby_dick.txt
    python bin/countwords.py data/moby_dick.txt > results/moby_dick.csv

As in the shell and many other programming languages, # indicates that the first line is a comment. The second and third lines form a build rule: the target of the rule is results/moby_dick.csv, its single prerequisite is the file data/moby_dick.txt, and the two are separated by a single colon :.

The target and prerequisite tell Make what depends on what. The line below them describes the recipe that will update the target if it is out of date. The recipe consists of one or more shell commands, each of which must be prefixed by a single tab character. Spaces cannot be used instead of tabs here, which can be confusing as they are interchangeable in most other programming languages. In the rule above, the recipe is “run bin/countwords.py on the raw data file and put the output in a CSV file in the results directory”.

To test our rule, run this command in the shell:

$ make

Make automatically looks for a file called Makefile, follows the rules it contains, and prints the commands that were executed. In this case it displays:

python bin/countwords.py data/moby_dick.txt > results/moby_dick.csv

When Make follows the rules in our Makefile, one of three things will happen:

  1. If results/moby_dick.csv doesn’t exist, Make runs the recipe to create it.
  2. If data/moby_dick.txt is newer than results/moby_dick.csv, Make runs the recipe to update the results.
  3. If results/moby_dick.csv is newer than its prerequisite, Make does nothing.

In the first two cases, Make prints the commands it runs along with anything those command prints to the screen via standard output or standard error. There is no screen output in this case, so we only see the command.

Indentation Errors

If a Makefile indents a rule with spaces rather than tabs, Make produces an error message like this:

Makefile:3: *** missing separator.  Stop.

No matter what happened the first time we ran make, if we run it again right away it does nothing because our rule’s target is now up to date. It tells us this by displaying the message:

make: `results/moby_dick.csv' is up to date.

We can check that it is telling the truth by listing the files with their timestamps, ordered by how recently they have been updated:

$ ls -l -t data/moby_dick.txt results/moby_dick.csv
-rw-r--r--  1 amira  staff   219107 31 Dec 08:58 results/moby_dick.csv
-rw-r--r--  1 amira  staff  1276201 31 Dec 08:58 data/moby_dick.txt

As a further test:

  1. Delete results/moby_dick.csv and run make again. This is case #1, so Make runs the recipe.
  2. Use touch data/moby_dick.txt to update the timestamp on the data file, then run make. This is case #2, so again, Make runs the recipe.

Managing Makefiles

We don’t have to call our file Makefile: if we prefer something like workflows.mk, we can tell Make to read recipes from that file using make -f workflows.mk.

8.2 Managing Multiple Files

Our Makefile isn’t particularly helpful so far, though it does already document exactly how to reproduce one specific result. Let’s add another rule to it:

# Regenerate results for "Moby Dick"
results/moby_dick.csv : data/moby_dick.txt
    python bin/countwords.py data/moby_dick.txt > results/moby_dick.csv

# Regenerate results for "Jane Eyre"
results/jane_eyre.csv : data/jane_eyre.txt
    python bin/countwords.py data/jane_eyre.txt > results/jane_eyre.csv

When we run make it tells us:

make: `results/moby_dick.csv' is up to date.

By default Make only attempts to update the first target it finds in the Makefile, which is called the default target. In this case, the first target is results/moby_dick.csv, which is already up to date. To update something else, we need to tell Make specifically what we want:

$ make results/jane_eyre.csv
python bin/countwords.py data/jane_eyre.txt > results/jane_eyre.csv

If we have to run make once for each result we’re right back where we started. However, we can add a rule to our Makefile to update all of our results at once. We do this by creating a phony target that doesn’t correspond to an actual file. Let’s add this line to the top of our Makefile:

all : results/moby_dick.csv results/jane_eyre.csv

There is no file called all, and this rule doesn’t have any recipes of its own, but when we run make all, Make finds everything that all depends on, then brings each of those prerequisites up to date (Figure 8.1).

Making Everything

Figure 8.1: Making Everything

The order in which rules appear in the Makefile does not necessarily determine the order in which recipes are run. Make is free to run commands in any order so long as nothing is updated before its prerequisites are up to date.

We can use phony targets to automate and document other steps in our workflow. For example, let’s add another target to our Makefile to delete all of the result files we have generated so that we can start afresh. By convention this target is called clean, and ours looks like this:

# Remove all generated files.
clean :
    rm -f results/*

The -f flag to rm means “force removal”: if it is present, rm won’t complain if the files we have told it to remove are already gone. If we now run:

$ make clean

Make will delete any results files we have. This is a lot safer than typing rm -f results/* at the command-line each time, because if we mistakenly put a space before the * we would delete all of the files in the project’s root directory.

Phony targets are very useful, but there is a catch. Try doing this:

$ mkdir clean
$ make clean
make: `clean' is up to date.

Since there is a directory called clean, Make thinks the target clean in the Makefile refers to this directory. Since the rule has no prerequisites, it can’t be out of date, so no recipes are executed.

We can unconfuse Make by putting this line at the top of Makefile to explicitly state which targets are phony:

.PHONY : all clean

8.3 Updating Files When Programs Change

Our current Makefile says that each result file depends only on the corresponding data file. That’s not actually true: each result also depends on the program used to generate it. If we change our program, we should regenerate our results. To get Make to do that, we can add the program to the prerequisites for each result:

# ...phony targets...

# Regenerate results for "Moby Dick"
results/moby_dick.csv : data/moby_dick.txt bin/countwords.py
    python bin/countwords.py data/moby_dick.txt > results/moby_dick.csv

# Regenerate results for "Jane Eyre"
results/jane_eyre.csv : data/jane_eyre.txt bin/countwords.py
    python bin/countwords.py data/jane_eyre.txt > results/jane_eyre.csv

To run both of these rules, we can type make all. Alternatively, since all is the first target in our Makefile, Make will use it if we just type make on its own:

$ touch bin/countwords.py
$ make
python bin/countwords.py data/moby_dick.txt > results/moby_dick.csv
python bin/countwords.py data/jane_eyre.txt > results/jane_eyre.csv

The exercises will explore how we can write a rule to tell us whether our results will be different after a change to a program without actually updating them. Rules like this can help us test our programs: if we don’t think an addition or modification ought to affect the results, but it would, we may have some debugging to do.

8.4 Reducing Repetition in a Makefile

Our Makefile now mentions bin/countwords.py four times. If we ever change the name of the program or move it to a different location, we will have to find and replace each of those occurrences. More importantly, this redundancy makes our Makefile harder to understand, just as scattering magic numbers through programs makes them harder to understand.

The solution is the same one we use in programs: define and use variables. Let’s create names for the word-counting script and the command used to run it:

# ...phony targets...

COUNT=bin/countwords.py
RUN_COUNT=python $(COUNT)

# Regenerate results for "Moby Dick"
results/moby_dick.csv : data/moby_dick.txt $(COUNT)
    $(RUN_COUNT) data/moby_dick.txt > results/moby_dick.csv

# Regenerate results for "Jane Eyre"
results/jane_eyre.csv : data/jane_eyre.txt $(COUNT)
    $(RUN_COUNT) data/jane_eyre.txt > results/jane_eyre.csv

Each definition takes the form NAME=value. Variables are written in upper case by convention so that they’ll stand out from filenames (which are usually in lower case), but Make doesn’t require this. What is required is using parentheses to refer to the variable, i.e., to use $(NAME) and not $NAME.

Why the Parentheses?

For historical reasons, Make interprets $NAME to be a “variable called N followed by the three characters ‘AME’”, If no variable called N exists, $NAME becomes AME, which is almost certainly not what we want.

As in programs, variables don’t just cut down on typing. They also tell readers that several things are always and exactly the same, which reduces cognitive load.

8.5 Automatic Variables

We could add a third rule to analyze a third novel and a fourth to analyze a fourth, but that won’t scale to hundreds or thousands of novels. Instead, we can write a generic rule that does what we want for every one of our data files.

To do this, we need to understand Make’s automatic variables. The first step is to use the very cryptic expression $@ in the rule’s recipe to mean “the target of the rule”. It lets us turn this:

# Regenerate results for "Moby Dick"
results/moby_dick.csv : data/moby_dick.txt $(COUNT)
    $(RUN_COUNT) data/moby_dick.txt > results/moby_dick.csv

into this:

# Regenerate results for "Moby Dick"
results/moby_dick.csv : data/moby_dick.txt $(COUNT)
    $(RUN_COUNT) data/moby_dick.txt > $@

Make defines a value of $@ separately for each rule, so it always refers to that rule’s target. And yes, $@ is an unfortunate name: something like $TARGET would have been easier to understand, but we’re stuck with it now.

The next step is to replace the explicit list of prerequisites in the recipe with the automatic variable $^, which means “all the prerequisites in the rule”:

# Regenerate results for "Jane Eyre"
results/moby_dick.csv : data/moby_dick.txt $(COUNT)
    $(RUN_COUNT) $^ > $@

However, this doesn’t work. The rule’s prerequisites are the novel and the word-counting program. When Make expands the recipe, the resulting command tries to process the program bin/countwords.py as if it was a data file:

python bin/countwords.py data/moby_dick.txt bin/countwords.py > results/moby_dick.csv

Make solves this problem with another automatic variable $<, which mean “only the first prerequisite”. Using it lets us rewrite our rule as:

# Regenerate results for "Jane Eyre"
results/moby_dick.csv : data/moby_dick.txt $(COUNT)
    $(RUN_COUNT) $< > $@

8.6 Generic Rules

$< > $@ is even harder to read than $@ on its own, but we can now replace all the rules for generating results files with one pattern rule using the wildcard %, which matches zero or more characters in a filename. Whatever matches % in the target also matches in the prerequisites, so the rule:

results/%.csv : data/%.txt $(COUNT)
    $(RUN_COUNT) $< > $@

will handle Jane Eyre, Moby Dick, The Time Machine, and every other novel in the data directory. %cannot be used in rules' recipes, which is why\(<` and `\)@` are needed.

With this rule in place, our entire Makefile is reduced to:

.PHONY: all clean

COUNT=bin/countwords.py
RUN_COUNT=python $(COUNT)

# Regenerate all results.
all : results/moby_dick.csv results/jane_eyre.csv results/time_machine.csv

# Regenerate result for any book.
results/%.csv : data/%.txt $(COUNT)
    $(RUN_COUNT) $< > $@

# Remove all generated files.

clean :
    rm -f results/*

To test our shortened Makefile, let’s delete all of the results files:

$ make clean
rm -f results/*

and then recreate them:

$ make  # Same as `make all` since "all" is the first target in the Makefile
python bin/countwords.py data/moby_dick.txt > results/moby_dick.csv
python bin/countwords.py data/jane_eyre.txt > results/jane_eyre.csv
python bin/countwords.py data/time_machine.txt > results/time_machine.csv

We can still rebuild individual files if we want, since Make will take the target filename we give on the command line and see if a pattern rule matches it:

$ # The touch command updates a file's timestamp
$ touch data/jane_eyre.txt
$ make results/jane_eyre.csv
python bin/countwords.py data/jane_eyre.txt > results/jane_eyre.csv

8.7 Defining Sets of Files

Our analysis is still not fully automated: if we add another book to data, we have to remember to add its name to the all target in the Makefile as well. Once again we will fix this in steps.

To start, imagine that all the results files already exist and we just want to update them. We can define a variable called RESULTS to be a list of all the results files using the same wildcards we would use in the shell:

RESULTS=results/*.csv

We can then rewrite all to depend on that list:

# Regenerate all results.
all : $(RESULTS)

However, this only works if the results files already exist. If one doesn’t, its name won’t be included in RESULTS and Make won’t realize that we want to generate it.

What we really want is to generate the list of results files based on the list of books in the data/ directory. We can create that list using Make’s wildcard function:

DATA=$(wildcard data/*.txt)

This calls the function wildcard with the argument data/*.txt. The result is a list of all the text files in the data directory, just as we would get with data/*.txt in the shell. The syntax is odd because functions were added to Make long after it was first written, but at least they have readable names.

To check that this line does the right thing, we can add another phony target called settings that uses the shell command echo to print the names and values of our variables:

.PHONY: all clean settings

# ...everything else...

# Show variables' values.
settings :
    echo COUNT: $(COUNT)
    echo DATA: $(DATA)

Let’s run this:

$ make settings
echo COUNT: bin/countwords.py
COUNT: bin/countwords.py
echo DATA: data/dracula.txt data/frankenstein.txt data/jane_eyre.txt data/moby_dick.txt data/sense_and_sensibility.txt data/sherlock_holmes.txt data/time_machine.txt
DATA: data/dracula.txt data/frankenstein.txt data/jane_eyre.txt data/moby_dick.txt data/sense_and_sensibility.txt data/sherlock_holmes.txt data/time_machine.txt

The output appears twice because Make shows us the command it’s going to run before running it. Putting @ before the command in the recipe prevents this, which makes the output easier to read:

settings :
    @echo COUNT: $(COUNT)
    @echo DATA: $(DATA)
$ make settings
COUNT: bin/countwords.py
DATA: data/dracula.txt data/frankenstein.txt data/jane_eyre.txt data/moby_dick.txt data/sense_and_sensibility.txt data/sherlock_holmes.txt data/time_machine.txt

We now have the names of our input files. To create a list of corresponding output files, we use Make’s patsubst function (short for pattern substitution):

RESULTS=$(patsubst data/%.txt,results/%.csv,$(DATA))

The first argument to patsubst is the pattern to look for, which in this case is a text file in the data directory. We use % to match the stem of the file’s name, which is the part we want to keep.

The second argument is the replacement we want. As in a pattern rule, Make replaces % in this argument with whatever matched % in the pattern, which creates the name of the result file we want. Finally, the third argument is what to do the substitution in, which is our list of books’ names.

Let’s check the RESULTS variable by adding another command to the settings target:

settings :
    @echo COUNT: $(COUNT)
    @echo DATA: $(DATA)
    @echo RESULTS: $(RESULTS)
$ make settings
COUNT: bin/countwords.py
DATA: data/dracula.txt data/frankenstein.txt data/jane_eyre.txt data/moby_dick.txt data/sense_and_sensibility.txt data/sherlock_holmes.txt data/time_machine.txt
RESULTS: results/dracula.csv results/frankenstein.csv results/jane_eyre.csv results/moby_dick.csv results/sense_and_sensibility.csv results/sherlock_holmes.csv results/time_machine.csv

Excellent: DATA has the names of the files we want to process and RESULTS automatically has the names of the corresponding result files. Since the phony target all depends on $(RESULTS) (i.e., all the files whose names appear in the variable RESULTS) we can regenerate all the results in one step:

$ make clean
rm -f results/*.csv
$ make  # Same as `make all` since "all" is the first target in the Makefile
python bin/countwords.py data/dracula.txt > results/dracula.csv
python bin/countwords.py data/frankenstein.txt > results/frankenstein.csv
python bin/countwords.py data/jane_eyre.txt > results/jane_eyre.csv
python bin/countwords.py data/moby_dick.txt > results/moby_dick.csv
python bin/countwords.py data/sense_and_sensibility.txt > results/sense_and_sensibility.csv
python bin/countwords.py data/sherlock_holmes.txt > results/sherlock_holmes.csv
python bin/countwords.py data/time_machine.txt > results/time_machine.csv

Our workflow is now just two steps: add a data file and run Make. This is a big improvement over running things manually, particularly as we start to add more steps like merging data files and generating plots.

8.8 Documenting a Makefile

Every well-behaved program should tell people how to use it Taschuk and Wilson (2017). If we run make --help, we get a (very) long list of options that Make understands, but nothing about our specific workflow. We could create another phony target called help that prints a list of available commands:

.PHONY: all clean help settings

# ...other definitions...

# Show help.
help :
    @echo "all : regenerate all out-of-date results files."
    @echo "results/*.csv : regenerate a particular results file."
    @echo "clean : remove all generated files."
    @echo "settings : show the values of all variables."
    @echo "help : show this message."

but sooner or later we will add a target or rule and forget to update this list.

A better approach is to format some comments in a special way and then extract and display those comments when asked to. We’ll use ## (a double comment marker) to indicate the lines we want displayed and grep (Section 3.5) to pull these lines out of the file:

.PHONY: all clean help settings

COUNT=bin/countwords.py
DATA=$(wildcard data/*.txt)
RESULTS=$(patsubst data/%.txt,results/%.csv,$(DATA))

## all : regenerate all results.
all : $(RESULTS)

## results/%.csv : regenerate result for any book.
results/%.csv : data/%.txt $(COUNT)
    python $(COUNT) $< > $@

## clean : remove all generated files.
clean :
    rm -f results/*.csv

## settings : show variables' values.
settings :
    @echo COUNT: $(COUNT)
    @echo DATA: $(DATA)
    @echo RESULTS: $(RESULTS)

## help : show this message.
help :
    @grep '^##' ./Makefile

Let’s test:

$ make help
## all : regenerate all results.
## results/%.csv : regenerate result for any book.
## clean : remove all generated files.
## settings : show variables' values.
## help : show this message.

The exercises will explore how to format this more readably.

8.9 Automating Entire Analyses

To finish our example, we will automatically generate a collated list of word frequencies. The target is a file called results/collated.csv that depends on the results generated by countwords.py. To create it, we add or change these lines in our Makefile:

# ...phony targets and previous variable definitions...

COLLATE=bin/collate.py

## all : regenerate all results.
all : results/collated.csv

## results/collated.csv : collate all results.
results/collated.csv : $(RESULTS) $(COLLATE)
    mkdir -p results  # Create dir if it doesn't exist
    python $(COLLATE) $(RESULTS) > $@

## settings : show variables' values.
settings :
    @echo COUNT: $(COUNT)
    @echo DATA: $(DATA)
    @echo RESULTS: $(RESULTS)
    @echo COLLATE: $(COLLATE)
    ...

The first two lines tell Make about the collation program, while the change to all tells it what the final target of our pipeline is. Since this target depends on the results files for single novels, make all will regenerate all of those automatically.

The rule to regenerate results/collated.csv should look familiar by now: it tells Make that all of the individual results have to be up-to-date and that the final result should be regenerated if the program used to create it has changed. One difference between the recipe in this rule and the recipes we’ve seen before is that this recipe uses $(RESULTS) directly instead of an automatic variable. We have written the rule this way because there isn’t an automatic variable that means “all but the last prerequisite”, so there’s no way to use automatic variables that wouldn’t result in us trying to process our program.

Likewise, we can also add the plotcounts.py script to this workflow and update the all and settings rules accordingly. Note that there is no > needed before the $@ because plotcounts.py default is to write to a file rather than to stdout.

# ...phony targets and previous variable definitions...

PLOT=bin/plotcounts.py

## all : regenerate all results.
all : results/collated.png

## results/collated.png: plot the collated results.
results/collated.png : results/collated.csv
    python $(PLOT) $< --outfile $@

## settings : show variables' values.
settings :
    @echo COUNT: $(COUNT)
    @echo DATA: $(DATA)
    @echo RESULTS: $(RESULTS)
    @echo COLLATE: $(COLLATE)
    @echo PLOT: $(PLOT)
    ...

Running make all should now generate the new collated.png plot (Figure 8.2):

$ make all
python bin/collate.py results/time_machine.csv results/moby_dick.csv results/jane_eyre.csv results/dracula.csv results/sense_and_sensibility.csv results/sherlock_holmes.csv results/frankenstein.csv > results/collated.csv
python bin/plotcounts.py results/collated.csv --outfile results/collated.png
alpha: 1.1712445413685917
Word count distribution for all the books combined.

Figure 8.2: Word count distribution for all the books combined.

Finally, we can update the clean target to only remove files created by the Makefile. It is a good habit to do this rather than using the asterisk wildcard to remove all files, since you might manually place files in the results directory and forget that these will be cleaned up when you run make clean.

# ...phony targets and previous variable definitions...

## clean : remove all generated files.
clean :
    rm $(RESULTS) results/collated.csv results/collated.png

8.10 Summary

The first version of Make was written in 1976. Its reliance on shell commands instead of direct calls to functions in Python or R sometimes makes it clumsy to use. However, that also makes it very flexible: a single Makefile can run shell commands and programs written in a variety of languages, which makes it a great way to assemble pipelines out of whatever is lying around.

Programmers have created many replacements for it in the decades since then—so many, in fact, that none have attracted enough users to displace it. If you would like to explore them, check out Snakemake (for Python) and drake (for R). If you want to go deeper, Smith (2011) describes the design and implementation of several build managers.

8.11 Exercises

Our Makefile currently reads as follows:

.PHONY: all clean help settings

COUNT=bin/countwords.py
COLLATE=bin/collate.py
PLOT=bin/plotcounts.py
DATA=$(wildcard data/*.txt)
RESULTS=$(patsubst data/%.txt,results/%.csv,$(DATA))

## all : regenerate all results.
all : results/collated.png

## results/collated.png: plot the collated results.
results/collated.png : results/collated.csv
    python $(PLOT) $< --outfile $@

## results/collated.csv : collate all results.
results/collated.csv : $(RESULTS) $(COLLATE)
    @mkdir -p results
    python $(COLLATE) $(RESULTS) > $@

## results/%.csv : regenerate result for any book.
results/%.csv : data/%.txt $(COUNT)
    python $(COUNT) $< > $@

## clean : remove all generated files.
clean :
    rm $(RESULTS) results/collated.csv results/collated.png

## settings : show variables' values.
settings :
    @echo COUNT: $(COUNT)
    @echo DATA: $(DATA)
    @echo RESULTS: $(RESULTS)
    @echo COLLATE: $(COLLATE)
    @echo PLOT: $(PLOT)

## help : show this message.
help :
    @grep '^##' ./Makefile

A number of the exercises below ask you to make further edits to Makefile.

8.11.1 Report results that would change

How can you get make to show the commands it would run without actually running them? (Hint: look at the manual page.)

8.11.2 Useful options

  1. What does Make’s -B option do and when is it useful?
  2. What about the -C option?
  3. What about the -f option?

8.11.3 Make sure the output directory exists

One of our build recipes includes mkdir -p. What does this do and why is it useful?

8.11.4 Print the title and author

The build rule for regenerating the result for any book is currently:

## results/%.csv : regenerate result for any book.
results/%.csv : data/%.txt $(COUNT)
    python $(COUNT) $< > $@

Add an extra line to the recipe that uses the book_summary.sh script to print the title and author of the book to the screen. Use @bash so that the command itself isn’t printed to the screen and don’t forget to update the settings build rule to include the book_summary.sh script.

If you’ve successfully made those changes, you should get the following output for Dracula:

$ make -B results/dracula.csv
Title: Dracula
Author: Bram Stoker
python bin/countwords.py data/dracula.txt > results/dracula.csv

8.11.5 Create all results

The default target of our final Makefile re-creates results/collated.csv. Add a target to Makefile so that make results creates or updates any result files that are missing or out of date, but does not regenerate results/collated.csv.

8.11.6 The perils of shell wildcards

What is wrong with writing the rule for results/collated.csv like this:

results/collated.csv : results/*.csv
    python $(COLLATE) $^ > $@

(The fact that the result no longer depends on the program used to create it isn’t the biggest problem.)

8.11.7 Making documentation more readable

We can format the documentation in our Makefile more readably using this command:

## help : show all commands.
help :
    @grep -h -E '^##' ${MAKEFILE_LIST} | sed -e 's/## //g' | column -t -s ':'

Using man and online search, explain what every part of this recipe does.

8.12 Configuration

Let’s say that in future we intend to write a number of different Makefiles that all use the countwords.py, collate.py and plotcounts.py scripts.

To avoid duplication, cut and paste the definitions of the COUNT, COLLATE and PLOT variables into a separate file called config.mk. Use the include command to access those definitions in the existing Makefile.

(In Chapter 9 we discuss configuration strategies in more detail.)

8.13 Key Points

  • Make is a widely-used build manager.
  • A build manager re-runs commands to update files that are out of date.
  • A build rule has targets, prerequisites, and a recipe.
  • A target can be a file or a phony target that simply triggers an action.
  • When a target is out of date with respect to its prerequisites, Make executes the recipe associated with its rule.
  • Make executes as many rules as it needs to when updating files, but always respects prerequisite order.
  • Make defines automatic variables such as $@ (target), $^ (all prerequisites), and $< (first prerequisite).
  • Pattern rules can use % as a placeholder for parts of filenames.
  • Makefiles can define variables using NAME=value.
  • Makefiles can also use functions such as $(wildcard ...) and $(patsubst ...).
  • Use specially-formatted comments to create self-documenting Makefiles.