Chapter 12 Testing Software

Opera happens because a large number of things amazingly fail to go wrong.

— Terry Pratchett

We have written software to count and analyze the words in classic texts, but how can we be sure it’s producing reliable results? The short is answer is that we can’t—not completely—but we can test its behavior against our expectations to decide if we are sure enough. This chapter therefore explores ways to do this, including assertions, unit tests, integration tests, and regression tests.

A Scientist’s Nightmare

A successful early career researcher in protein crystallography, Geoffrey Chang, had to retract five published papers—three from the journal Science—because his code had inadvertently flipped two columns of data (G. Miller 2006). More recently, a simple calculation mistake in a paper by Reinhart and Rogoff contributed to making the financial crash of 2008 even worse for millions of people (Borwein and Bailey 2013).

Our Zipf’s Law project files are structured as they were at the end of the previous chapter:

zipf/
├── .gitignore
├── CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE.md
├── Makefile
├── README.md
├── bin
│   ├── book_summary.sh
│   ├── collate.py
│   ├── countwords.py
│   ├── plotcounts.py
│   ├── plotparams.yml
│   ├── script_template.py
│   └── utilities.py
├── data
│   ├── README.md
│   ├── dracula.txt
│   └── ...
└── results
    ├── dracula.csv
    ├── dracula.png
    └── ...

12.1 Assertions

The first step in building confidence in our programs is to assume that mistakes will happen and guard against them. This is called defensive programming, and the most common way to do it is to add assertions to our code so that it checks itself as it runs. An assertion is a statement that something must be true at a certain point in a program. When Python sees an assertion, it checks the assertion’s condition. If it’s true, Python does nothing; if it’s false, Python halts the program immediately and prints a user-defined error message. For example, this code halts as soon as the loop encounters an impossible word frequency:

total = 0.0
for freq in frequencies[:10]:
    assert freq >= 0.0, 'Word frequencies must be non-negative'
    total += freq
print('total frequency of first 10 words:', total)
-----------------------------------------------------------------
AssertionError                  Traceback (most recent call last)
<ipython-input-19-33d87ea29ae4> in <module>()
      2 total = 0.0
      3 for freq in frequencies[:10]:
----> 4     assert freq >= 0.0, 'Word frequencies must be non-negative'
      5     total += freq
      6 print('total frequency of first 10 words:', total)

AssertionError: Word frequencies must be non-negative

Programs intended for widespread use are full of assertions: 10-20% of the code they contain are there to check that the other 80-90% are working correctly. Broadly speaking, assertions fall into three categories:

  • A precondition is something that must be true at the start of a function in order for it to work correctly. For example, a function might check that the list it has been given has at least two elements and that all of its elements are integers.

  • A postcondition is something that the function guarantees is true when it finishes. For example, a function could check that the value being returned is an integer that is greater than zero, but less than the length of the input list.

  • An invariant is something that is true for every iteration in a loop. The invariant might be a property of the data (as in the example above), or it might be something like, “the value of highest is less than or equal to the current loop index.”

The function get_power_law_params in our plotcounts.py script is a good example of the need for a precondition. Its docstring does not say that its word_counts parameter must be a list of numeric word counts; even if we add that, a user might easily pass in a list of the words themselves instead. Adding an assertion makes the requirement clearer, and also guarantees that the function will fail as soon as it is called rather than returning an error from scipy.optimize.minimize_scalar that would be more difficult to interpret/debug.

def get_power_law_params(word_counts):
    """
    Get the power law parameters.

    References
    ----------
    Moreno-Sanchez et al (2016) define alpha (Eq. 1),
      beta (Eq. 2) and the maximum likelihood estimation (mle)
      of beta (Eq. 6).

    Moreno-Sanchez I, Font-Clos F, Corral A (2016)
      Large-Scale Analysis of Zipf's Law in English Texts.
      PLoS ONE 11(1): e0147073.
      https://doi.org/10.1371/journal.pone.0147073
    """
    assert type(word_counts) == np.ndarray, \
           'Input must be a numerical (numpy) array of word counts'
    mle = minimize_scalar(nlog_likelihood, bracket=(1 + 1e-10, 4),
                          args=word_counts, method='brent')
    beta = mle.x
    alpha = 1 / (beta - 1)
    return alpha

12.2 Unit Testing

As the name suggests, a unit test checks the correctness of a single unit of software. Exactly what constitutes a “unit” is subjective, but it typically means the behavior of a single function in one situation. In our Zipf’s Law software, the count_words function in wordcounts.py is a good candidate for unit testing:

def count_words(reader):
    """Count the occurrence of each word in a string."""
    text = reader.read()
    findwords = re.compile(r"\w+", re.IGNORECASE)
    word_list = re.findall(findwords, text)
    word_counts = Counter(word_list)
    return word_counts

A single unit test will typically have:

  • a fixture, which is the thing being tested (e.g., an array of numbers);
  • an actual result, which is what the code produces when given the fixture; and
  • an expected result that the actual result is compared to.

The fixture is typically a subset or smaller version of the data the function will typically process. For instance, in order to write a unit test for the count_words function, we could use a piece of text small enough for us to count word frequencies by hand. Let’s add the poem “Risk” by Anaïs Nin to our data:

$ mkdir test_data
$ cat test_data/risk.txt
And then the day came,
when the risk
to remain tight
in a bud
was more painful
than the risk
it took
to blossom.

We can then count the words by hand to construct the expected result:

from collections import Counter

risk_poem_counts = {'the': 3, 'risk': 2, 'to': 2, 'and': 1,
  'then': 1, 'day': 1, 'came': 1, 'when': 1, 'remain': 1, 'tight':
  1, 'in': 1, 'a': 1, 'bud': 1, 'was': 1, 'more': 1, 'painful': 1,
  'than': 1, 'it': 1, 'took': 1, 'blossom': 1}
expected_result = Counter(risk_poem_counts)

We then generate the actual result by calling word_counts, and use an assertion to check if it is what we expected:

import countwords

with open('test_data/risk.txt', 'r') as reader:
    actual_result = countwords.count_words(reader)
assert actual_result == expected_result

There’s no output, which means the assertion (and test) passed. (Remember, assertions only do something if the condition is false.)

12.3 Testing Frameworks

Writing one unit test is easy enough, but we should check other cases as well. To manage them, we can use a test framework (also called a test runner). The most widely-used test framework for Python is called pytest, which structures tests as follows:

  1. Tests are put in files whose names begin with test_.
  2. Each test is a function whose name also begins with test_.
  3. These functions use assert to check results.

Following these rules, we can create a test_zipfs.py script that contains the test we just developed:

from collections import Counter
import countwords

def test_word_count():
    """Test the counting of words.
    
    The example poem is Risk, by Anaïs Nin.
    """
    risk_poem_counts = {'the': 3, 'risk': 2, 'to': 2, 'and': 1, 
      'then': 1, 'day': 1, 'came': 1, 'when': 1, 'remain': 1,
      'tight': 1, 'in': 1, 'a': 1, 'bud': 1, 'was': 1, 'more': 1,
      'painful': 1, 'than': 1, 'it': 1, 'took': 1, 'blossom': 1}
    expected_result = Counter(risk_poem_counts)
    with open('test_data/risk.txt', 'r') as reader:
        actual_result = countwords.count_words(reader)
    assert actual_result == expected_result

The pytest library comes with a command-line tool that is also called pytest. When we run it with no options, it searches for all files in or below the working directory whose names match the pattern test_*.py. It then runs the tests in these files and summarizes their results. (If we only want to run the tests in a particular file, we can use the command pytest path/to/test_file.py.)

$ pytest
====================== test session starts =======================
platform darwin -- Python 3.7.6, pytest-5.4.1, py-1.8.1, pluggy-0.12.0
rootdir: /Users/amira
collected 1 item                                                               

bin/test_zipfs.py .                                        [100%]

======================= 1 passed in 0.02s ========================

To add more tests, we simply write more test_ functions in test_zipfs.py. For instance, besides counting words, the other critical part of our code is the calculation of the \(\alpha\) parameter. Earlier we defined a power law relating \(\alpha\) to the word frequency \(f\), the word rank \(r\), and a constant of proportionality \(c\) (Section 7.3).\[ r = cf^{\frac{-1}{\alpha}} \] We also noted that Zipf’s Law holds exactly when \(\alpha\) is equal to one. Setting \(\alpha\) to one and re-arranging the power law gives us\[ c = f/r \]

We can use this formula to generate synthetic word counts data (i.e. our test fixture) with a constant of proportionality set to a hypothetical maximum word frequency of 600 (and thus \(r\) ranges from 1 to 600):

import numpy as np

max_freq = 600
word_counts = np.floor(max_freq / np.arange(1, max_freq + 1)) 

(We use np.floor to round down to the nearest whole number, because we can’t have fractional word counts.) Passing this test fixture to get_power_law_params in plotcounts.py:

def get_power_law_params(word_counts):
    """
    Get the power law parameters.
    References
    ----------
    Moreno-Sanchez et al (2016) define alpha (Eq. 1),
      beta (Eq. 2) and the maximum likelihood estimation (mle)
      of beta (Eq. 6).
    Moreno-Sanchez I, Font-Clos F, Corral A (2016)
      Large-Scale Analysis of Zipf's Law in English Texts.
      PLoS ONE 11(1): e0147073.
      https://doi.org/10.1371/journal.pone.0147073
    """
    assert type(word_counts) == np.ndarray, \
           'Input must be a numerical (numpy) array of word counts'
    mle = minimize_scalar(nlog_likelihood, bracket=(1 + 1e-10, 4),
                          args=(word_counts), method='brent')
    beta = mle.x
    alpha = 1 / (beta - 1)
    return alpha

should give us a value of 1.0. To test this, we can add a second test to test_zipfs.py,

import numpy as np
from collections import Counter

import plotcounts
import countwords

def test_alpha():
    """Test the calculation of the alpha parameter.
    
    The test word counts satisfy the relationship,
      r = cf**(-1/alpha), where
      r is the rank,
      f the word count, and
      c is a constant of proportionality.

    To generate test word counts for an expected alpha value of 
      1.0, a maximum word frequency of 600 is used
      (i.e. c = 600 and r ranges from 1 to 600)
    """    
    max_freq = 600
    word_counts = np.floor(max_freq / np.arange(1, max_freq + 1))
    actual_alpha = plotcounts.get_power_law_params(word_counts)
    expected_alpha = 1.0
    assert actual_alpha == expected_alpha

def test_word_count():
    ...as before...

Let’s re-run both of our tests:

$ pytest
====================== test session starts =======================
platform darwin -- Python 3.7.6, pytest-5.4.1, py-1.8.1, pluggy-0.12.0
rootdir: /Users/amira
collected 2 items                                                              

bin/test_zipfs.py F.                                       [100%]

============================ FAILURES ============================
___________________________ test_alpha ___________________________

    def test_alpha():
        """Test the calculation of the alpha parameter.
    
        The test word counts satisfy the relationship,
          r = cf**(-1/alpha), where
          r is the rank,
          f the word count, and
          c is a constant of proportionality.
    
        To generate test word counts for an expected alpha value of 
          1.0, a maximum word frequency of 600 is used
          (i.e. c = 600 and r ranges from 1 to 600)
        """
        max_freq = 600
        word_counts = np.floor(max_freq / np.arange(1, max_freq + 1))
        actual_alpha = plotcounts.get_power_law_params(word_counts)
        expected_alpha = 1.0
>       assert actual_alpha == expected_alpha
E       assert 0.9951524579316625 == 1.0

bin/test_zipfs.py:24: AssertionError
==================== short test summary info =====================
FAILED bin/test_zipfs.py::test_alpha - assert 0.99515245793 == 1.0
================== 1 failed, 1 passed in 0.85s ===================

The output tells us that one test failed but the other test passed. This is a very useful feature of test runners like pytest: they continue on and complete all the tests rather than stopping at the first assertion failure as a regular Python script would.

12.4 Testing Floating-Point Values

Looking at the output, we can see that while test_alpha failed, the actual_alpha value of 0.9951524579316625 was very close to the expected value of 1.0. After a bit of thought, we decide that this isn’t actually a failure: the value produced by get_power_law_params is an estimate, and being off by half a percent is good enough.

This example shows that testing scientific software almost always requires us to make the same kind of judgment calls that scientists have to make when doing any other sort of experimental work. If we are measuring the mass of a proton, we might expect ten decimal places of accuracy. If we are measuring the weight of a baby penguin, on the other hand, we’ll probably be satisfied if we’re within five grams. What matters most is that we are explicit about the bounds we used so that other people can tell what we actually did.

Degrees of Difficulty

There’s an old joke that physicists worry about decimal places, astronomers worry about powers of ten, and economists are happy if they’ve got the sign right.

So how should we write tests when we don’t know precisely what the right answer is? The best approach is to write tests that check if the actual value is within some tolerance of the expected value. The tolerance can be expressed as the absolute error, which is the absolute value of the difference between two, or the relative error, which the ratio of the absolute error to the value we’re approximating (Goldberg 1991). For example, if we add 9+1 and get 11, the absolute error is 1 (i.e., 11-10), and the relative error is 10%. If we add 99+1 and get 101, on the other hand, the absolute error is still 1, but the relative error is only 1%.

For test_alpha, we might decide that an absolute error of 0.01 in the estimation of \(\alpha\) is acceptable. If we are using pytest, we can check that values lie within this tolerance using pytest.approx:

import pytest
import numpy as np
from collections import Counter

import plotcounts
import countwords

def test_alpha():
    """Test the calculation of the alpha parameter.
    
    The test word counts satisfy the relationship,
      r = cf**(-1/alpha), where
      r is the rank,
      f the word count, and
      c is a constant of proportionality.

    To generate test word counts for an expected alpha value of 
      1.0, a maximum word frequency of 600 is used
      (i.e. c = 600 and r ranges from 1 to 600)
    """    
    max_freq = 600
    word_counts = np.floor(max_freq / np.arange(1, max_freq + 1))
    actual_alpha = plotcounts.get_power_law_params(word_counts)
    expected_alpha = pytest.approx(1.0, abs=0.01)
    assert actual_alpha == expected_alpha

def test_word_count():
    ...as before...

When we re-run pytest, both tests now pass:

$ pytest
====================== test session starts =======================
platform darwin -- Python 3.7.6, pytest-5.4.1, py-1.8.1, pluggy-0.12.0
rootdir: /Users/amira
collected 2 items                                                              

bin/test_zipfs.py ..                                       [100%]

======================= 2 passed in 0.69s ========================

Testing Visualizations

Testing visualizations is hard: any change to the dimension of the plot, however small, can change many pixels in a raster image, and cosmetic changes such as moving the legend up a couple of pixels will cause all of our tests to fail.

The simplest solution is therefore to test the data used to produce the image rather than the image itself. Unless we suspect that the plotting library contains bugs, the correct data should always produce the correct plot.

12.5 Testing Error Handling

An alarm isn’t much use if it doesn’t go off when it’s supposed to. Equally, if a function doesn’t raise an exception when it should (Section 11.1), errors can easily slip past us. If we want to check that a function called func raises an ExpectedError exception we can use the following template:

...set up fixture...
try:
    actual = func(fixture)
    assert False, 'Expected function to raise exception'
except ExpectedError as error:
    pass
except Exception as error:
    assert False, 'Function raised the wrong exception'

This template has three cases:

  1. If the call to func returns a value without throwing an exception then something has gone wrong, so we assert False (which always fails).

  2. If func raises the error it’s supposed to then we go into the first except branch without triggering the assert immediately below the function call. The code in this except branch could check that the exception contains the right error message, but in this case it does nothing (which in Python is written pass).

  3. Finally, if the function raises the wrong kind of exception we also assert False. Checking this case might seem overly cautious, but if the function raises the wrong kind of exception, users could easily fail to catch it.

This pattern is so common that pytest provides support for it. Instead of the eight lines in our original example, we can instead write:

import pytest

...set up fixture...
with pytest.raises(ExpectedError):
    actual = func(fixture)

The argument to pytest.raises is the type of exception we expect. The call to the function then goes in the body of the with statement. We will explore pytest.raises further in the exercises.

12.6 Integration Testing

Our Zipf’s Law analysis has two steps: counting the words in a text and estimating the \(\alpha\) parameter from the word count. Our unit tests give us some confidence that these components work in isolation, but do they work correctly together? Checking that is called integration testing.

Integration tests are structured the same way as unit tests: a fixture is used to produce an actual result that is compared against the expected result. However, creating the fixture and running the code can be considerably more complicated. For example, in the case of our Zipf’s Law software an appropriate integration test fixture might be a text file with a word frequency distribution that has a known \(\alpha\) value. In order to create this text fixture, we need a way to generate random words.

Fortunately, a Python library called randomwordgenerator exists to do just that. We can install it and the pypandoc library it depends on using pip, the Python Package Installer:

$ pip install pypandoc
$ pip install randomwordgenerator

Borrowing from the word count distribution we created for test_alpha, we can then create a text file full of random words with a frequency distribution that corresponds to an \(\alpha\) of approximately 1.0:

import numpy as np
from randomwordgenerator import randomwordgenerator

max_freq = 600
word_counts = np.floor(max_freq / np.arange(1, max_freq + 1))
random_words = randomwordgenerator.generate_random_words(n=max_freq)
writer = open('test_data/random_words.txt', 'w')
for index in range(max_freq):
    word_sequence = f"{random_words[index]} " * int(word_counts[index])
    writer.write(word_sequence + '\n')
writer.close()

We can then add this integration test to test_zipfs.py:

def test_integration():
    """Test the full word count to alpha parameter workflow."""    

    with open('test_data/random_words.txt', 'r') as reader:
        word_counts_dict = countwords.count_words(reader)
    word_counts_array = np.array(list(word_counts_dict.values()))
    actual_alpha = plotcounts.get_power_law_params(word_counts_array)
    expected_alpha = pytest.approx(1.0, abs=0.01)
    assert actual_alpha == expected_alpha

Finally, we re-run pytest to check that the integration test passes:

$ pytest
====================== test session starts =======================
platform darwin -- Python 3.7.6, pytest-5.4.1, py-1.8.1, pluggy-0.12.0
rootdir: /Users/amira
collected 3 items                                                                                         

bin/test_zipfs.py ...                                      [100%]

======================== 3 passed in 0.48s =======================

12.7 Regression Testing

So far we have tested two simplified texts: a short poem and a collection of random words with a known frequency distribution. The next step is to test with real data, i.e., an actual book. The problem is, we don’t know the expected result: it’s not practical to count the words in Dracula by hand, and even if we tried, the odds are good that we’d make a mistake. For this kind of situation we can use regression testing. Rather than assuming that the test author knows what the expected result should be, regression tests compares today’s answer with a previous one. This doesn’t guarantee that the answer is right—if the original answer is wrong, we could carry that mistake forward indefinitely—but it does draw attention to any changes (or “regressions”).

In Section 7.4 we calculated an \(\alpha\) of 1.0866646252515038 for Dracula. Let’s use that value to add a regression test to test_zipfs.py:

def test_regression():
    """Regression test for Dracula."""    

    with open('data/dracula.txt', 'r') as reader:
        word_counts_dict = countwords.count_words(reader)
    word_counts_array = np.array(list(word_counts_dict.values()))
    actual_alpha = plotcounts.get_power_law_params(word_counts_array)
    expected_alpha = pytest.approx(1.087, abs=0.001)
    assert actual_alpha == expected_alpha
$ pytest
====================== test session starts =======================
platform darwin -- Python 3.7.6, pytest-5.4.1, py-1.8.1, pluggy-0.12.0
rootdir: /Users/amira
collected 4 items                                                                                

bin/test_zipfs.py ....                                     [100%]

======================= 4 passed in 0.56s ========================

12.8 Test Coverage

How much of our code do the tests we have written so far actually check? To find out, we can use a tool to check their code coverage. Most Python programmers use the coverage library, which we can once again install using pip:

$ pip install coverage

Once we have it, we can use it to run pytest on our behalf:

$ coverage run -m pytest
====================================== test session starts =======================================
platform darwin -- Python 3.7.6, pytest-5.4.1, py-1.8.1, pluggy-0.12.0
rootdir: /Users/amira
collected 4 items                                                                                

bin/test_zipfs.py ....                                                                     [100%]

======================================= 4 passed in 0.72s ========================================

The coverage command doesn’t display any information of its own, since mixing that in with our program’s output would be confusing. Instead, it puts coverage data in a file called .coverage (with a leading .) in the current directory. To display that data, we run:

$ coverage report -m
Name            Stmts   Miss  Cover   Missing
---------------------------------------------
countwords.py      20      8    60%   19-21, 25-31
plotcounts.py      46     27    41%   41-47, 67-69, 74-90, 94-104
test_zipfs.py      32      0   100%
utilities.py        7      4    43%   24-27
---------------------------------------------
TOTAL             105     39    63%

This summary shows us that some lines of countwords.py or plotcounts.py were not executed when we ran the tests: in fact, only 60% and 41% of the lines were run respectively. This makes sense, since much of the code in those scripts is devoted to handling command line arguments or file I/O rather than the word counting and parameter estimation functionality that our unit, integration and regression tests focus on.

To make sure that’s the case, we can get a more complete report by running coverage html at the command line and opening htmlcov/index.html. Clicking on the name of our countwords.py script, for instance, produces the colorized line-by-line display shown in Figure 12.1.

Example of Python code coverage report.

Figure 12.1: Example of Python code coverage report.

This output confirms that all lines relating to word counting were tested, but not any of the lines related to argument handling or I/O.

Is this good enough? The answer depends on what the software is being used for and by whom. If it is for a safety-critical application such as a medical device, we should aim for 100% code coverage, i.e., every single line in the application should be tested. In fact, we should probably go further and aim for 100% path coverage to ensure that every possible path through the code has been checked. Similarly, if the software has become popular and is being used by thousands of researchers all over the world, we should probably check that it’s not going to embarrass us.

But most of us don’t write software that people’s lives depend on, or that is in a “top 100” list, so requiring 100% code coverage is like asking for ten decimal places of accuracy when checking the voltage of a household electrical outlet. We always need to balance the effort required to create tests against the likelihood that those tests will uncover useful information. We also have to accept that no amount of testing can prove a piece of software is completely correct. A function with only two numeric arguments has 2128 possible inputs. Even if we could write the tests, how could we be sure we were checking the result of each one correctly?

Luckily, we can usually put test cases into groups. For example, when testing a function that summarizes a table full of data, it’s probably enough to check that it handles table with:

  • no rows
  • only one row
  • many identical rows
  • rows having keys that are supposed to be unique, but aren’t
  • rows that contain nothing but missing values

Some projects develop checklists like this one to remind programmers what they ought to test. These checklists can be a bit daunting for newcomers, but they are a great way to pass on hard-earned experience.

12.9 Continuous Integration

Now that we have a set of tests, we could run pytest every now and again to check our code. This is probably sufficient for short-lived projects, but if several people are involved, or if we are making changes over weeks or months, we might forget to run the tests or it might be difficult to identify which change is responsible for a test failure.

The solution is continuous integration (CI), which runs tests automatically whenever a change is made. CI tells developers immediately if changes have caused problems, which makes them much easier to fix. CI can also be set up to run tests with several different configurations of the software or on several different operating systems, so that a programmer using Windows can be warned that a change breaks things for Mac users and vice versa.

One popular CI tool is Travis CI, which integrates well with GitHub. If Travis CI has been set up, then every time a change is committed to a GitHub repository, Travis CI creates a fresh environment, makes a fresh clone of the repository (Section 7.8), and runs whatever commands the project’s managers have set up.

To set up CI for a project, we must:

  1. Create an account on Travis CI (if we don’t already have one).
  2. Link our Travis CI account to our GitHub account (if we haven’t done so already).
  3. Tell Travis CI to watch the repository that contains our project.

Creating an account with an online service is probably a familiar process, but linking our Travis CI account to our GitHub account may be something new. We only have to do this once to allow Travis CI to access all our GitHub repositories, but we should always be careful when giving sites access to other sites, and only trust well-established and widely-used services.

Once we have created an account, we can tell Travis CI which repository we want it to watch by clicking the “+” next to the “My Repositories” link on the left-hand side of the Travis CI homepage (Figure 12.2).

Click to add a new GitHub repository to Travis CI.

Figure 12.2: Click to add a new GitHub repository to Travis CI.

To add the GitHub repository we have been using throughout the course, find it in the repository list and toggle the switch so that it turns green (Figure 12.3). If the repository doesn’t show up, re-synchronize the list using the green “Sync account” button on the left sidebar. If it still doesn’t appear, the repository may belong to someone else or be private.

Find Zipf's Law repository and switch it on.

Figure 12.3: Find Zipf’s Law repository and switch it on.

The next step is to tell Travis CI what we want it to do by creating a file called .travis.yml. (The leading . in the name hides the file from casual listings on Mac or Linux, but not on Windows.) This file must be in the root directory of the repository, and is written in YAML (Section 10.1 and Appendix H). For our project, we add the following lines:

language: python

python:
- "3.6"

script:
- pytest

The language key tells Travis CI which programming language to use, so that it knows which of its standard virtual machines to use as a starting point for the project. The python key specifies the version or versions of Python to use, while the script key lists the commands to run—in this case, pytest. We can now go ahead and push the .travis.yml file to GitHub.

$ git add .travis.yml 
$ git commit -m "Initial commit of travis configuration file"
[master 71084f7] Initial commit of travis file
 1 file changed, 4 insertions(+)
 create mode 100644 .travis.yml
$ git push origin master
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 4 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 344 bytes | 344.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/amira-khan/zipf.git
   1f0590b..71084f7  master -> master

When this commit reaches GitHub, that site notifies Travis CI that the repository has changed. Travis CI then follows the instructions in .travis.yml and reports whether the build passed (shown in green) or produced warnings or errors (shown in red). To create this report, Travis CI has:

  1. Created a new Linux virtual machine.
  2. Installed the desired version of Python.
  3. Ran the commands below the script key.
  4. Reported the results at https://travis-ci.org/USER/REPO, where USER/REPO identifies the repository for a given user.

In this case, we can see that the build failed (Figure 12.4).

Travis build overview (build failed).

Figure 12.4: Travis build overview (build failed).

Scrolling down to read the job log in detail, it says that it “could not locate requirements.txt.” This happens because the Python scripts that are run when pytest is executed (i.e. test_zipfs.py, plotcounts.py, countwords.py and utilities.py) import a number of packages that don’t come with the Python Standard Library. To fix this problem, we need to do two things. The first is to add an install key to .travis.yml:

language: python

python:
- "3.6"

install:
- pip install -r requirements.txt

script:
- pytest

The second is to create requirements.txt, which lists the libraries that need to be installed:

numpy
pandas
matplotlib
scipy
pytest
pyyaml

We commit these changes to GitHub:

$ git add .travis.yml requirements.txt
$ git commit -m "Adding requirements"
[master d96593f] Adding requirements
 2 files changed, 16 insertions(+), 1 deletion(-)
 create mode 100644 requirements.txt
$ git push origin master
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 4 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 344 bytes | 344.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/amira-khan/zipf.git
   1f0590b..71084f7  master -> master

Travis CI automatically runs again. This time our tests pass and the build completes successfully (Figure 12.5).

Travis build overview (build succeeded).

Figure 12.5: Travis build overview (build succeeded).

This example shows one of the other benefits of CI: it forces us to be explicit about what we are doing and how we do it, just as writing a Makefile forces us to be explicit about exactly how we produce results (Zampetti et al. 2020).

12.10 When to Write Tests

We have now met the three major types of test: unit, integration and regression. At what point in the code development process should we write these? The answer depends on who you ask.

Many programmers are passionate advocates of a practice called test-driven development (TDD). Rather than writing code and then writing tests, they write the tests first and then write just enough code to make those tests pass. Once the code is working, they clean it up (Section G.4) and then move on to the next task. TDD’s advocates claim that this leads to better code because:

  1. Writing tests clarifies what the code is actually supposed to do.

  2. It eliminates confirmation bias. If someone has just written a function, they are predisposed to want it to be right, so they will bias their tests towards proving that it is correct instead of trying to uncover errors.

  3. Writing tests first ensures that they actually get written.

These arguments are plausible. However, studies such as Fucci et al. (2016) and Fucci et al. (2017) don’t support them: in practice, writing tests first or last doesn’t appear to affect productivity. What does have an impact is working in small, interleaved increments, i.e., writing just a few lines of code and testing it before moving on rather than writing several pages of code and then spending hours on testing.

So how do most data scientists figure out if their software is doing the right thing? The answer is spot checks: each time they produce an intermediate or final result, they scan a table, create a chart, or inspect some summary statistics to see if everything looks OK. Their heuristics are usually easy to state, like “there shouldn’t be NAs at this point” or “the age range should be reasonable,” but applying those heuristics to a particular analysis always depends on their evolving insight into the data in question.

By analogy with test-driven development, we could call this process “checking-driven development.” Each time we add a step to our pipeline and look at its output, we can also add a check of some kind to the pipeline to ensure that what we are checking for remains true as the pipeline evolves or is run on other data. Doing this helps reusability—it’s amazing how often a one-off analysis winds up being used many times—but the real goal is comprehensibility. If someone can get our code and data, then runs the code on the data, and gets the same result that we did, then our computation is reproducible, but that doesn’t mean they can understand it. Comments help (either in the code or as blocks of prose in a computational notebook), but they won’t check that assumptions and invariants hold. And unlike comments, runnable assertions can’t fall out of step with what the code is actually doing.

12.11 Summary

Testing data analysis pipelines is often harder than testing mainstream software applications, since data analysts often don’t know what the right answer is (Braiek and Khomh 2018). (If we did, we would have submitted our report and moved on to the next problem already.) The key distinction is the difference between validation, which asks whether the specification is correct, and verification, which asks whether we have met that specification. The difference between them is the difference between building the right thing and building something right; the practices introduced in this chapter will help with both.

12.12 Exercises

12.12.1 Explaining assertions

Given a list of a numbers, the function total returns the total:

total([1, 2, 3, 4])
10

total only works on numbers:

total(['a', 'b', 'c'])
ValueError: invalid literal for int() with base 10: 'a'

Explain in words what the assertions in this function check, and for each one, give an example of input that will make that assertion fail.

def total(values):
    assert len(values) > 0
    for element in values:
        assert int(element)
    values = [int(element) for element in values]
    total = sum(values)
    assert total > 0
    return total

12.12.2 Rectangle normalization

A rectangle can be described using a tuple of four cartesian coordinates (x0, y0, x1, y1), where (x0, y0) represents the lower left corner and (x1, y1) the upper right. In order to do some calculations, suppose we need to be able to normalize rectangles so that the lower left corner is at the origin (i.e. (x0, y0) = (0, 0)) and the longest side is 1.0 units long. This function does that:

def normalize_rectangle(rect):
    """Normalizes a rectangle so that it is at the origin
    and 1.0 units long on its longest axis.
    Input should be of the format (x0, y0, x1, y1).
    (x0, y0) and (x1, y1) define the lower left and
    upper right corners of the rectangle, respectively."""
    
    x0, y0, x1, y1 = rect  # insert preconditions before and after
    
    dx = x1 - x0
    dy = y1 - y0
    if dx > dy:
        scaled = float(dx) / dy
        upper_x, upper_y = 1.0, scaled
    else:
        scaled = float(dx) / dy
        upper_x, upper_y = scaled, 1.0

    # insert postconditions here

    return (0, 0, upper_x, upper_y)

In order to answer the following questions, cut and paste the normalize_rectangle function into a new file called geometry.py and save that file in a new directory called exercises.

  1. To ensure that the inputs to normalize_rectangle are valid, add preconditions to check that
  1. rect contains 4 coordinates,
  2. the width of the rectangle is a positive, non-zero value (i.e. x0 < x1), and
  3. the height of the rectangle is a positive, non-zero value (i.e. y0 < y1).
  1. If the normalization calculation has worked correctly, the new x1 coordinate will lie between 0 and 1 (i.e. 0 < upper_x <= 1.0). Add a postcondition to check that this is true. Do the same for the new y1 coordinate, upper_y.

Running normalize_rectangle for a short, wide rectangle should pass your new preconditions and postconditions,

import geometry

geometry.normalize_rectangle([2, 5, 3, 10])                                                             
(0, 0, 0.2, 1.0)

but will fail for a tall, skinny rectangle:

import geometry

geometry.normalize_rectangle([20, 15, 30, 20])
AssertionError                  Traceback (most recent call last)
<ipython-input-3-f4e8cdf7f69d> in <module>
----> 1 geometry.normalize_rectangle([20, 15, 30, 20])

~/Desktop/exercises/geometry.py in normalize_rectangle(rect)
     19 
     20     assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
---> 21     assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
     22 
     23     return (0, 0, upper_x, upper_y)

AssertionError: Calculated upper Y coordinate invalid
  1. Find and correct the source of the error in normalize_rectangle. Once fixed, you should be able to successfully run geometry.normalize_rectangle([20, 15, 30, 20]).

  2. Write a unit test for tall, skinny rectangles and save it in a new file called test_geometry.py. Run pytest to make sure the test passes.

  3. Add a couple more unit tests to test_geometry.py. Explain the rationale behind each test.

12.12.3 Test error handling

In Chapter 11, we modified collate.py to handle different types of errors associated with reading input files. The relevant code appears in main:

"""
Combine multiple word count CSV-files into a single cumulative count.
"""
import csv
import argparse
from collections import Counter
import logging
import utilities


def update_counts(reader, word_counts):
    """Update word counts with data from another reader/file."""
    for word, count in csv.reader(reader):
        word_counts[word] += int(count)

def main(args):
    """Run the command line program."""
    log_level = logging.DEBUG if args.verbose else logging.WARNING
    logging.basicConfig(level=log_level, filename=args.logfile)
    word_counts = Counter()
    logging.info('Processing files...')
    for file_name in args.infiles:
        try:
            logging.debug(f'Reading in {file_name}...')
            if file_name[-4:] != '.csv':
                raise OSError(utilities.ERROR_MESSAGES['not_csv_file_suffix'].format(file_name=file_name))
            with open(file_name, 'r') as reader:
                logging.debug('Computing word counts...')
                update_counts(reader, word_counts)
        except FileNotFoundError:
            logging.warning(f'{file_name} not processed: File does not exist')
        except PermissionError:
            logging.warning(f'{file_name} not processed: No permission to read file')
        except Exception as error:
            logging.warning(f'{file_name} not processed: {error}')
    utilities.collection_to_csv(word_counts, num=args.num)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infiles', type=str, nargs='*', 
                        help='Input file names')
    parser.add_argument('-n', '--num', type=int, default=None,
                        help='Limit output to N most frequent words')
    parser.add_argument(
      '-v', '--verbose', action="store_true", 
      default=False,
      help="Change logging threshold from WARNING to DEBUG"
    )
    parser.add_argument('-l', '--logfile', type=str, 
                        default='collate.log',
                        help='Name of the log file')
    args = parser.parse_args()
    main(args)

In this chapter we discussed using pytest.raises to test whether the error handling in a program is working as expected (Section 12.5).

  1. It is difficult to write a simple unit test for the lines of code dedicated to reading input files, because main is a long function that requires command line arguments as input. Edit collate.py so that the six lines of code responsible for processing an input file appear in their own function that reads as follows (i.e. once you are done, main should call process_file in place of the existing code):
def process_file(file_name, word_counts):
    """Read file and update word counts"""
    logging.debug(f'Reading in {file_name}...')
    if file_name[-4:] != '.csv':
        raise OSError(utilities.ERROR_MESSAGES['not_csv_file_suffix'].format(file_name=file_name))
    with open(file_name, 'r') as reader:
        logging.debug('Computing word counts...')
        update_counts(reader, word_counts)
  1. Add a unit test to test_zipfs.py that uses pytest.raises to check that the new collate.process_file function raises an OSError if the input file does not end in .csv. Run pytest to check that the new test passes.

  2. Add a unit test to test_zipfs.py that uses pytest.raises to check that the new collate.process_file function raises a FileNotFoundError if the input file does not exist. Run pytest to check that the new test passes.

  3. Use the coverage library to check that the relevant commands in process_file (specifically raise OSError and open(file_name, 'r')) were indeed tested.

12.12.4 Testing with randomness

Programs that rely on random numbers are impossible to test because there’s (deliberately) no way to predict their output. Luckily, computer programs don’t actually use random numbers: they use a pseudo-random number generator (PRNG) that produces values in a repeatable but unpredictable way. Given the same initial seed, a PRNG will always produce the same sequence of values. How can we use this fact when testing programs that rely on pseudo-random numbers?

12.12.5 Testing with relative error

If E is the expected result of a function and A is the actual value it produces, the relative error is abs((A-E)/E). This means that if we expect the results of tests to be 2, 1, and 0, and we actually get 2.1, 1.1, and 0.1 the relative errors are 5%, 10%, and infinity. Why does this seem counter-intuitive, and what might be a better way to measure error in this case?

12.13 Key Points

  • Test software to convince people (including yourself) that software is correct enough and to make tolerances on “enough” explicit.
  • Add assertions to code so that it checks itself as it runs.
  • Write unit tests to check indivdiual pieces of code.
  • Write integration tests to check that those pieces work together correctly.
  • Write regression tests to check if things that used to work no longer do.
  • A test framework finds and runs tests written in a prescribed fashion and reports their results.
  • Test coverage is the fraction of lines of code that are executed by a set of tests.
  • Continuous integration re-builds and/or re-tests software every time something changes.