Chapter 9 Program Configuration

Always be wary of any helpful item that weighs less than its operating manual.

— Terry Pratchett

In previous chapters we used command-line options to control our scripts and programs. If they are more complex, we may want to use up to four layers of configuration:

  1. A system-wide configuration file for general settings.
  2. A user-specific configuration file for personal preferences.
  3. A job-specific file with settings for a particular run.
  4. Command-line options to change things that commonly change.

This is sometimes called overlay configuration because each level overrides the ones above it: the user’s configuration file overrides the system settings, the job configuration overrides the user’s defaults, and the command-line options overrides that. This is more complex than most research software needs initially (Xu et al. 2015), but being able to read a complete set of options from a file is a big boost to reproducibility.

In this chapter, we’ll explore approaches for configuring our project, and apply one approach to our Zipf’s Law project. That project should now contain:

zipf/
├── .gitignore
├── CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE.md
├── Makefile
├── README.md
├── bin
│   ├── book_summary.sh
│   ├── collate.py
│   ├── countwords.py
│   ├── plotcounts.py
│   ├── script_template.py
│   └── utilities.py
├── data
│   ├── README.md
│   ├── dracula.txt
│   └── ...
└── results
    ├── dracula.csv
    ├── dracula.png
    └── ...

Be Careful When Applying Settings Outside Your Project

This chapter’s examples modify files outside of the Zipf’s Law project in order to illustrate some concepts. If you alter these files while following along, remember to change them back later.

9.1 Configuration File Formats

Programmers have invented far too many formats for configuration files, so please do not create one of your own. One possibility is to write the configuration as a Python module and load it as if it was a library. This is clever, but means that tools in other languages can’t process it.

A second option is Windows INI format, which is laid out like this:

[section_1]
key_1=value_1
key_2=value_2

[section_2]
key_3=value_3
key_4=value_4

INI files are simple to read and write, but the format is slowly falling out of use in favor of YAML. A simple YAML configuration file looks like this:

# Standard settings for thesis.
logfile: "/tmp/log.txt"
quiet: false
overwrite: false
fonts:
- Verdana
- Serif

Here, the keys logfile, quiet, and overwrite have the values /tmp/log.txt, false, and false respectively, while the value associated with the key fonts is a list containing Verdana and Serif. For more discussion of YAML, see Appendix L.

9.2 Matplotlib Configuration

To see overlay configuration in action, let’s consider a common task in data science: changing the size of the labels on a plot. The labels on our Jane Eyre word frequency plot are fine for viewing on screen (Figure 9.1), but they will need to be bigger if we want to include the figure in a slideshow or report.

Word frequency distribution for the book Jane Eyre with default label sizes.

Figure 9.1: Word frequency distribution for the book Jane Eyre with default label sizes.

We could use any of the overlay options described above to change the size of the labels:

  • Edit the system-wide Matplotlib configuration file (which would affect everyone using this computer).
  • Create a user-specific Matplotlib style sheet.
  • Create a job-specific configuration file to set plotting options in plotcounts.py.
  • Add some new command-line options to plotcounts.py .

Let’s consider these options one by one.

9.3 The Global Configuration File

Our first option is to edit the system-wide Matplotlib runtime configuration file, which is called matplotlibrc. When we import Matplotlib, it uses this file to set the default characteristics of the plot. We can find it on our system by running this command:

import matplotlib as mpl
mpl.matplotlib_fname()
/Users/amira/opt/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/matplotlibrc

In this case the file is located in the Python installation directory (anaconda3). All the different Python packages installed with Anaconda live in a python3.7/site-packages directory, including Matplotlib.

matplotlibrc lists all the default settings as comments. The default size of the X and Y axis labels is “medium”, as is the size of the tick labels:

#axes.labelsize       : medium  ## fontsize of the x any y labels
#xtick.labelsize      : medium  ## fontsize of the tick labels
#ytick.labelsize      : medium  ## fontsize of the tick labels

We can uncomment those lines and change the sizes to “large” and “extra large”:

axes.labelsize       : x-large  ## fontsize of the x any y labels
xtick.labelsize      : large    ## fontsize of the tick labels
ytick.labelsize      : large    ## fontsize of the tick labels

and then re-generate the Jane Eyre plot with bigger labels (Figure 9.2):

$ python bin/plotcounts.py results/jane_eyre.csv --outfile results/jane_eyre.png
Word frequency distribution for the book Jane Eyre with larger label sizes.

Figure 9.2: Word frequency distribution for the book Jane Eyre with larger label sizes.

This does what we want, but is usually the wrong approach. Since matplotlibrc file sets system-wide defaults, we will now have big labels by default for all plotting we do in future, which we may not want. Secondly, we want to package our Zipf’s Law code and make it available to other people (Chapter 13). That package won’t include our matplotlibrc file, and we don’t have access to the one on their computer, so this solution isn’t as reproducible as others.

A global options file is useful, though. If we are using Matplotlib with LaTeX to generate reports and the latter is installed in an unusual place on our computing cluster, a one-line change in matplotlibrc can prevent a lot of failed jobs.

9.4 The User Configuration File

Matplotlib defines several carefully-designed styles for plots:

import matplotlib.pyplot as plt
print(plt.style.available)
['seaborn-dark', 'seaborn-darkgrid', 'seaborn-ticks', 'fivethirtyeight', 'seaborn-whitegrid', 'classic', '_classic_test', 'fast', 'seaborn-talk', 'seaborn-dark-palette', 'seaborn-bright', 'seaborn-pastel', 'grayscale', 'seaborn-notebook', 'ggplot', 'seaborn-colorblind', 'seaborn-muted', 'seaborn', 'Solarize_Light2', 'seaborn-paper', 'bmh', 'tableau-colorblind10', 'seaborn-white', 'dark_background', 'seaborn-poster', 'seaborn-deep']

In order to make the labels bigger in all of our Zipf’s Law plots, we could create a custom Matplotlib style sheet. The convention is to store custom style sheets in a stylelib sub-directory in the Matplotlib configuration directory. That directory can be located by running the following command:

mpl.get_configdir()
/Users/amira/.matplotlib

Once we’ve created the new sub-directory,

$ mkdir /Users/amira/.matplotlib/stylelib

we can add a new file called /Users/amira/.matplotlib/stylelib/big-labels.mplstyle that has the same YAML format as the matplotlibrc file:

axes.labelsize   : x-large  ## fontsize of the x any y labels
xtick.labelsize  : large    ## fontsize of the tick labels
ytick.labelsize  : large    ## fontsize of the tick labels

To use this new style, we would just need to add one line to plotcounts.py:

plt.style.use('big-labels')

Using a custom style sheet leaves the system-wide defaults unchanged, and it’s a good way to achieve a consistent look across our personal data visualization projects. However, since each user has their own stylelib directory, it doesn’t solve the problem of ensuring that other people can reproduce our plots.

9.5 Adding Command-Line Options

The third way to change the plot’s properties is to add some new command-line arguments to plotcounts.py. The choices parameter of add_argument lets us tell argparse that the user is only allowed to specify a value from a predefined list:

mpl_sizes = ['xx-small', 'x-small', 'small', 'medium', 'large', 'x-large', 'xx-large']
parser.add_argument('--labelsize', type=str, default='x-large', choices=mpl_sizes,
                    help='fontsize of the x any y labels')
parser.add_argument('--xticksize', type=str, default='large', choices=mpl_sizes,
                    help='fontsize of the x tick labels')
parser.add_argument('--yticksize', type=str, default='large', choices=mpl_sizes,
                    help='fontsize of the y tick labels')

We can then add a few lines after the ax variable is defined in plotcounts.py to update the label sizes according to the user input:

ax.xaxis.label.set_fontsize(args.labelsize)
ax.yaxis.label.set_fontsize(args.labelsize)
ax.xaxis.set_tick_params(labelsize=args.xticksize)
ax.yaxis.set_tick_params(labelsize=args.yticksize)

Alternatively, we can change the default runtime configuration settings before the plot is created. These are stored in a variable called matplotlib.rcParams:

mpl.rcParams['axes.labelsize'] = args.labelsize
mpl.rcParams['xtick.labelsize'] = args.xticksize
mpl.rcParams['ytick.labelsize'] = args.yticksize

Adding extra command line arguments is a good solution if we only want to change a small number of plot characteristics. It also makes our work more reproducible: if we use a Makefile to regenerate our plots (Chapter 8), the settings will all be saved in one place. However, matplotlibrc has hundreds of parameters we could change, so the number of new arguments can quickly get out of hand if we want to tweak other aspects of the plot.

9.6 A Job Control File

The final option—the one we will actually adopt in this case— is to pass a YAML file full of Matplotlib parameters to plotcounts.py. First, we save the parameters we want to change in a file inside our project directory. We can call it anything, but plotparams.yml seems like it will be easy to remember:

axes.labelsize   : x-large  ## fontsize of the x any y labels
xtick.labelsize  : large    ## fontsize of the tick labels
ytick.labelsize  : large    ## fontsize of the tick labels

Because this file is located in our project directory instead of the user-specific style sheet directory, we need to add one new option to plotcounts.py to load it:

parser.add_argument('--plotparams', type=str, default=None,
                    help='YAML file containing matplotlib parameters')

We can use Python’s yaml library to read that file:

with open('plotparams.yml', 'r') as reader:
    plot_params = yaml.load(reader, Loader=yaml.BaseLoader)
print(plot_params)
{'axes.labelsize': 'x-large',
 'xtick.labelsize': 'large',
 'ytick.labelsize': 'large'}

and then loop over each item in plot_params to update Matplotlib’s mpl.rcParams:

for (param, value) in param_dict.items():
    mpl.rcParams[param] = value

plotcounts.py now looks like this:

"""Plot word counts."""
import argparse
import yaml
import numpy as np
import pandas as pd
import matplotlib as mpl
from scipy.optimize import minimize_scalar


def nlog_likelihood(beta, counts):
    # ...as before...


def get_power_law_params(word_counts):
    # ...as before...


def set_plot_params(param_file):
    """Set the matplotlib parameters."""
    if param_file:
        with open(param_file, 'r') as reader:
            param_dict = yaml.load(reader, Loader=yaml.BaseLoader)
    else:
        param_dict = {}
    for param, value in param_dict.items():
        mpl.rcParams[param] = value


def plot_fit(curve_xmin, curve_xmax, max_rank, beta, ax):
    # ...as before...


def main(args):
    """Run the command line program."""
    set_plot_params(args.plotparams)
    df = pd.read_csv(args.infile, header=None, names=('word', 'word_frequency'))
    df['rank'] = df['word_frequency'].rank(ascending=False, method='max')
    ax = df.plot.scatter(x='word_frequency', y='rank', loglog=True,
                         figsize=[12, 6], grid=True, xlim=args.xlim)

    alpha, beta = get_power_law_params(df['word_frequency'].to_numpy())
    print('alpha:', alpha)

    # Since the ranks are already sorted, we can take the last one instead of
    # computing which row has the highest rank
    max_rank = df['rank'].to_numpy()[-1]

    # Use the range of the data as the boundaries when drawing the power law curve
    curve_xmin = df['word_frequency'].min()
    curve_xmax = df['word_frequency'].max()

    # Plot the result
    plot_fit(curve_xmin, curve_xmax, max_rank, alpha, ax)
    ax.figure.savefig(args.outfile)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=argparse.FileType('r'), nargs='?',
                        default='-', help='Word count csv file name')
    parser.add_argument('--outfile', type=str, default='plotcounts.png',
                        help='Output image file name')
    parser.add_argument('--xlim', type=float, nargs=2, metavar=('XMIN', 'XMAX'),
                        default=None, help='X-axis limits')
    parser.add_argument('--plotparams', type=str, default=None,
                        help='YAML file containing matplotlib parameters')
    args = parser.parse_args()
    main(args)

9.7 Summary

Programs are only useful if they can be controlled, and work is only reproducible if those controls are explicit and shareable. If the number of controls needed is small, adding command-line options to programs and setting those options in Makefiles is usually the best solution. As the number of options grows, so too does the value of putting options in files of their own. And if we are installing the software on large systems that are used by many people, such as a research cluster, system-wide configuration files let us hide the details from people who just want to get their science done.

More generally, the problem of configuring a program illustrates the difference between “works for me on my machine” and “works for everyone, everywhere”. From reproducible workflows (Chapter 8) to logging (Section 10.4), this difference influences every aspect of a research software engineer’s work. We don’t always have to design for large-scale re-use, but knowing what it entails allows us to make a conscious, thoughtful choice.

9.8 Exercises

9.8.1 Building with plotting parameters

In the Makefile created in Chapter 8, the build rule involving plotcounts.py was defined as:

## results/collated.png: plot the collated results.
results/collated.png : results/collated.csv
    python $(PLOT) $< --outfile $@

Update that build rule to include the new --plotparams option. Make sure plotparams.yml is added as a second prerequisite in the updated build rule so that the appropriate commands will be re-run if the plotting parameters change.

Hint: We use the automatic variable $< to access the first prerequisite, but you’ll need $(word 2,$^) to access the second. Read about automatic variables (Section 8.5) and functions for string substitution and analysis to understand what that command is doing.

9.8.2 Using different plot styles

There are a wide variety of pre-defined matplotlib styles (Section 9.4), as illustrated at the Python Graph Gallery.

  1. Add a new option --style to plotcounts.py that allows the user to pick a style from the list of pre-defined matplotlib styles.

Hint: Use the choices parameter discussed in Section 9.5 to define the valid choices for the new --style option.

  1. Re-generate the plot of the Jane Eyre word count distribution using a bunch of different styles to decide which you like best.

  2. Matplotlib style sheets are designed to be composed together. (See the style sheets tutorial for details.) Use the nargs parameter to allow the user to pass any number of styles when using the --style option.

9.8.3 Saving configurations

  1. Add an option --saveconfig filename to plotcounts.py that writes all of the configuration for plotcounts.py to a file (or to standard output if the filename given is -). Make sure this option saves all of the configuration, including any defaults that the user hasn’t changed.
  2. Add an option --loadconfig filename to plotcounts.py that reads settings from a file like the one created in the previous step.
  3. How would you use these two options when debugging?

9.8.4 Tracing configuration

One drawback of overlay configuration is the difficulty of figuring out where particular parameters are set. Modify the functions written in this chapter so that they keep track of where parameters’ values come from. (Hint: create a second dictionary whose keys match those in the dictionary holding configuration values, but whose values are a list of the places where that configuration parameter was set.)

9.8.5 Using INI syntax

If we used Windows INI format instead of YAML for our plot parameters configuration file (i.e. plotparams.ini instead of plotparams.yml) that file would read as follows:

[AXES]
axes.labelsize=x-large

[TICKS]
xtick.labelsize=large
ytick.labelsize=large

The configparser library can be used to read and write INI files. Install that library by running pip install configparser at the command line.

Using configparser, rewrite the set_plot_params function in plotcounts.py to handle a configuration file in INI rather than YAML format.

Which file format do you find easier to work with? What other factors should influence your choice of a configuration file syntax?

9.8.6 Configuration consistency

In order for a data processing pipeline to work correctly, some of the configuration parameters for Program A and Program B must be the same. However, the programs were written by different teams, and each has its own configuration file. What steps could you take to ensure the required consistency?

9.9 Key Points

  • Overlay configuration specifies settings for a program in layers, each of which overrides previous layers.
  • Use a system-wide configuration file for general settings.
  • Use a user-specific configuration file for personal preferences.
  • Use a job-specific configuration file with settings for a particular run.
  • Use command-line options to change things that commonly change.
  • Use YAML or some other standard syntax to write configuration files.
  • Save configuration information to make your research reproducible.