Chapter 13 Python Packaging

Another response of the wizards, when faced with a new and unique situation, was to look through their libraries to see if it had ever happened before. This was…a good survival trait. It meant that in times of danger you spent the day sitting very quietly in a building with very thick walls.

— Terry Pratchett

The more software we write, the more we think of a programming language as a way to build and combine libraries. Every widely-used language now has an online repository from which people can download and install those libraries. This lesson shows you how to use Python’s tools to create and share libraries of your own.

We will continue with our Zipf’s Law project, which should include the following files:

zipf/
├── .gitignore
├── CONDUCT.md
├── CONTRIBUTING.md
├── KhanVirtanen2020.md
├── LICENSE.md
├── Makefile
├── README.md
├── environment.yml
├── requirements.txt
├── bin
│   ├── book_summary.sh
│   ├── collate.py
│   ├── countwords.py
│   ├── plotcounts.py
│   ├── plotparams.yml
│   ├── script_template.py
│   ├── test_zipfs.py
│   └── utilities.py
├── data
│   ├── README.md
│   ├── dracula.txt
│   └── ...
├── results
│   ├── dracula.csv
│   ├── dracula.png
│   └── ...
└── test_data
    ├── random_words.txt
    └── risk.txt

13.1 Creating a Python Package

A package consists of one or more Python source files in a specific directory structure combined with installation instructions for the computer. Python packages can come from various sources: some are distributed with Python itself, but anyone can create one, and there are thousands that can be downloaded and installed from online repositories.

Terminology

People sometimes refer to packages as modules. Strictly speaking, a module is a single source file, while a package is a directory structure that contains one or more modules.

A generic package folder hierarchy looks like this:

pkg_name
├── pkg_name
│   ├── module1.py
│   └── module2.py
├── README.md
└── setup.py

The top-level directory is named after the package. It contains a directory that is also named after the package, and that contains the package’s source files. It is initially a little confusing to have two directories with the same name, but most Python projects follow this convention because it makes it easier to set up the project for installation. We can get this structure in our Zipf’s Law project by renaming zipf/bin to zipf.

__init__.py

Python packages often contain a file with a special name: __init__.py (two underscores before and after init). Just as importing a module file executes the code in the module, importing a package executes the code in __init__.py. Packages had to have this file before Python 3.3, even if it was empty, but since Python 3.3 it is only needed if we want to run some code as the package is being imported.

To make the Zipf’s Law project work as a Python package, we only need to make one important change to the code itself: changing the syntax for how we import our own modules. Currently, both collate.py and countwords.py contains this line,

import utilities

while test_zipfs.py contains:

import plotcounts
import countwords

These are called implicit relative imports, because it is not clear whether we mean “import a Python package called utilities” or “import a file in our local directory called utilities.py” (which is what we want). To remove this ambiguity we need to be explicit and write,

from zipf import utilities

and

from zipf import plotcounts
from zipf import countwords

These are absolute imports since we are specifying the full location of utilities, plotcounts and countwords inside the zipf package. Absolute imports are the preferred way for parts of a package to import other parts, but we can also use explicit relative imports, which require a little less typing and can sometimes make it easier to restructure very large projects:

from . import utilities

Here, the . signals that utilities exists in the current directory.

Python has several ways to build an installable package. We will show how to use setuptools, which is the lowest common denominator and will allow everyone, regardless of what Python distribution they have, to use our package. To use setuptools, we must create a file called setup.py in the directory above the root directory of the package. (This is why we require the two-level directory structure described earlier.) setup.py must have exactly that name, and must contain lines like these:

from setuptools import setup


setup(
    name='zipf',
    version='0.1.0',
    author='Amira Khan',
    packages=['zipf'])

The name and author parameters are self-explanatory. Most software projects use semantic versioning for software releases. A version number consists of three integers X.Y.Z, where X is the major version, Y is the minor version, and Z is the patch version. Major version zero (0.Y.Z) is for initial development, so we have started with 0.1.0. The first stable public release would be version 1.0.0, and in general, the version number is incremented as follows:

  • Increment major every time there’s an incompatible externally-visible change
  • Increment minor when adding new functionality in a backwards-compatible manner (i.e. without breaking any existing code)
  • Increment patch for backwards-compatible bug fixes that don’t add any new features

Finally, we specify the name of the directory containing the code to be packaged with the packages parameter. This is straightforward in our case because we only have a single package directory. For more complex projects, the find_packages function from setuptools can automatically find all packages by recursively searching the current directory.

13.2 Virtual Environments

We can add additional information to our package later, but this is enough to be able to build it for testing purposes. Before we do that, though, we should create a virtual environment to test how our package installs without breaking anything in our main Python installation.

A virtual environment is a layer on top of an existing Python installation. Whenever Python needs to find a package, it looks in the virtual environment before checking the main Python installation. This gives us a place to install packages that only some projects need without affecting other projects.

Virtual environments also help with package development:

  • We want to be able to easily test install and uninstall our package, without affecting the entire Python environment.
  • We want to answer problems people have with our package with something more helpful than “I don’t know, it works for me”. By installing and running our package in a completely empty environment, we can ensure that we’re not accidentally relying on other packages being installed.

We can manage virtual environments using conda (Appendix I). To create a new virtual environment called zipf we run conda create, specifying the environment’s name with the -n or --name flag and listing python as the base to build on:

$ conda create -n zipf python
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/amira/anaconda3/envs/zipf

...

Proceed ([y]/n)? y

...

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate zipf
#
# To deactivate an active environment, use
#
#     $ conda deactivate

conda creates the directory ~/anaconda3/envs/zipf, which contains the subdirectories needed for a minimal Python installation, such as bin and lib. It also creates ~/anaconda3/envs/zipf/bin/python, which checks for packages in these directories before checking the main installation.

We can switch to the zipf environment by running:

$ conda activate zipf

Once we have done this, the python command runs the interpreter in zipf/bin:

(zipf)$ which python
/home/amira/anaconda3/envs/zipf/bin/python

Notice that every shell command displays (zipf) when that virtual environment is active. Between Git branches and virtual environments, it can be very easy to lose track of what exactly we are working on and with. Prompts like this can make it a little less confusing; using virtual environment names that match the names of your projects (and branches, if you’re testing different environments on different branches) quickly becomes essential.

We can now install packages safely. Everything we install will go into zipf virtual environment without affecting the underlying Python installation. When we are done, we can switch back to the default environment using conda deactivate:

(zipf)$ conda deactivate
$ which python
/usr/bin/python

13.3 Installing a Development Package

Let’s install our package inside this virtual environment. First we re-activate it:

$ conda activate zipf

Next, we go into the upper zipf directory that contains our setup.py file and install our package using pip install -e .. The -e option indicates that we want to to install the package in “editable” mode, which means that any changes we make in the package code are directly available to use without having to reinstall the package; the . means “install from the current directory”:

(zipf)$ cd zipf
(zipf)$ pip install -e .
Processing /home/amira/proj/py-rse/zipf/zipf
Building wheels for collected packages: zipf
  Building wheel for zipf (setup.py) ... done
  Created wheel for zipf: filename=zipf-0.1.0-py3-none-any.whl size=4574 sha256=b7d645f1d07775714855a83d0cc62911c8502eb917fcb3fe2d9fe46206c84656
  Stored in directory: /tmp/pip-ephem-wheel-cache-19cuetii/wheels/a8/a6/0e/8b2a5cbf87d4a33551e65bc911bfdee49a4c117b0c5c834a47
Successfully built zipf
Installing collected packages: zipf
Successfully installed zipf-0.1.0

If we look in ~/anaconda3/envs/zipf/lib/python3.8/site-packages/, we can see the zipf package beside all the other locally-installed packages. If we try to use the package at this stage, though, Python will complain that some of the packages it depends on, such as pandas, are not installed. We could install these manually, but it is more reliable to automate this process by listing everything that our package depends on using the install_requires parameter in setup.py:

from setuptools import setup


setup(
    name='zipf',
    version='0.1',
    author='Amira Khan',
    packages=['zipf'],
    install_requires=[
        'matplotlib',
        'pandas',
        'scipy',
        'pyyaml',
        'pytest'])

We don’t have to list numpy explicitly because it will be installed as a dependency for pandas and scipy.

Versioning Dependencies

It is good practice to specify the versions of our dependencies and even better to specify version ranges. For example, if we have only tested our package on pandas version 1.0.1, we could put pandas==1.0.1 or pandas>=1.0.1 instead of just pandas in the list argument passed to the install_requires parameter.

We can now install our package and all its dependencies in a single command:

(zipf)$ pip install -e .
Obtaining file:///home/amira/zipf
Collecting matplotlib
  Downloading matplotlib-3.2.1-cp37-cp37m-manylinux1_x86_64.whl (12.4 MB)
     |████████████████████████████████| 12.4 MB 1.9 MB/s
Collecting pandas
  Downloading pandas-1.0.3-cp37-cp37m-manylinux1_x86_64.whl (10.0 MB)
     |████████████████████████████████| 10.0 MB 16.1 MB/s
Collecting scipy
  Downloading scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1 MB)
     |████████████████████████████████| 26.1 MB 11.4 MB/s
Requirement already satisfied: pyyaml in /home/amira/anaconda3/envs/zipf/lib/python3.7/site-packages (from zipf==0.1) (5.3.1)
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1
  Using cached pyparsing-2.4.6-py2.py3-none-any.whl (67 kB)
Collecting python-dateutil>=2.1
  Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting kiwisolver>=1.0.1
  Downloading kiwisolver-1.2.0-cp37-cp37m-manylinux1_x86_64.whl (88 kB)
     |████████████████████████████████| 88 kB 8.6 MB/s
Collecting cycler>=0.10
  Using cached cycler-0.10.0-py2.py3-none-any.whl (6.5 kB)
Collecting numpy>=1.11
  Downloading numpy-1.18.2-cp37-cp37m-manylinux1_x86_64.whl (20.2 MB)
     |████████████████████████████████| 20.2 MB 16.3 MB/s
Requirement already satisfied: pytz>=2017.2 in /home/amira/anaconda3/envs/zipf/lib/python3.7/site-packages (from pandas->zipf==0.1) (2019.3)
Requirement already satisfied: six>=1.5 in /home/amira/anaconda3/envs/zipf/lib/python3.7/site-packages (from python-dateutil>=2.1->matplotlib->zipf==0.1) (1.14.0)
Installing collected packages: pyparsing, python-dateutil, kiwisolver, cycler, numpy, matplotlib, pandas, scipy, zipf
  Running setup.py develop for zipf
Successfully installed cycler-0.10.0 kiwisolver-1.2.0 matplotlib-3.2.1 numpy-1.18.2 pandas-1.0.3 pyparsing-2.4.6 python-dateutil-2.8.1 scipy-1.4.1 zipf

(The precise output of this command will change depending on which versions of our dependencies get installed.)

We can now import our package in a script or a Jupyter notebook just as we would any other package. For example, to use the function in utilities, we would write:

from zipf import utilities


utilities.collection_to_csv(...)

However, the useful command-line scripts that we used to count and plot word counts are no longer accessible directly from the terminal. Fortunately, the setuptools package allows us to install programs along with the package. These programs are placed beside those of other packages. We tell setuptools to do this by defining entry points:

from setuptools import setup


setup(
    name='zipf',
    version='0.1',
    author='Amira Khan',
    packages=['zipf'],
    install_requires=[
        'matplotlib',
        'pandas',
        'scipy',
        'pyyaml',
        'pytest'],
    entry_points={
        'console_scripts': [
            'countwords = zipf.countwords:main',
            'collate = zipf.collate:main',
            'plotcounts = zipf.plotcounts:main']})

The right side of the = operator is the location of a function, written as package.module:function; the left side is the name we want to use to call this function from the command line. In this case we want to call each module’s main, which as it stands requires an input argument args containing the command-line arguments given by the user (Section 4.2). For example, the relevant section of our countwords.py program is:

def main(args):
    """Run the command line program."""
    with args.infile as reader:
        word_counts = count_words(reader)
    utilities.collection_to_csv(word_counts, num=args.num)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=argparse.FileType('r'), nargs='?',
                        default='-', help='Input file name')
    parser.add_argument('-n', '--num', type=int, default=None,
                        help='Limit output to N most frequent words')
    args = parser.parse_args()
    main(args)

We can’t pass any arguments to main when we define entry points in our setup.py file, so we need to change this slightly:

def parse_command_line():
    """Parse the command line for input arguments."""
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=argparse.FileType('r'), nargs='?',
                        default='-', help='Input file name')
    parser.add_argument('-n', '--num', type=int, default=None,
                        help='Limit output to N most frequent words')
    args = parser.parse_args()
    return args

def main():
    """Run the command line program."""
    args = parse_command_line()
    with args.infile as reader:
        word_counts = count_words(reader)
    utilities.collection_to_csv(word_counts, num=args.num)

if __name__ == '__main__':
    main()

Once we have made the corresponding change in collate.py and plotcounts.py, we can re-install our package:

(zipf)$ pip install -e .
Defaulting to user installation because normal site-packages is not writeable
Obtaining file:///home/amira/zipf
Requirement already satisfied: matplotlib in /usr/lib/python3.8/site-packages (from zipf==0.1) (3.2.1)
Requirement already satisfied: pandas in /home/amira/.local/lib/python3.8/site-packages (from zipf==0.1) (1.0.3)
Requirement already satisfied: scipy in /usr/lib/python3.8/site-packages (from zipf==0.1) (1.4.1)
Requirement already satisfied: pyyaml in /usr/lib/python3.8/site-packages (from zipf==0.1) (5.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/lib/python3.8/site-packages (from matplotlib->zipf==0.1) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/lib/python3.8/site-packages (from matplotlib->zipf==0.1) (1.1.0)
Requirement already satisfied: numpy>=1.11 in /usr/lib/python3.8/site-packages (from matplotlib->zipf==0.1) (1.18.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/lib/python3.8/site-packages (from matplotlib->zipf==0.1) (2.4.6)
Requirement already satisfied: python-dateutil>=2.1 in /usr/lib/python3.8/site-packages (from matplotlib->zipf==0.1) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/lib/python3.8/site-packages (from pandas->zipf==0.1) (2019.3)
Requirement already satisfied: six in /usr/lib/python3.8/site-packages (from cycler>=0.10->matplotlib->zipf==0.1) (1.14.0)
Requirement already satisfied: setuptools in /usr/lib/python3.8/site-packages (from kiwisolver>=1.0.1->matplotlib->zipf==0.1) (46.1.3)
Installing collected packages: zipf
  Running setup.py develop for zipf
Successfully installed zipf

The output looks slightly different than the first run because pip could re-use some packages saved locally by the previous install rather than re-fetching them from online repositories. (If we hadn’t used the -e option to make the package immediately editable, we would have to uninstall it before reinstalling it during development.)

We can now use our commands directly from the terminal without writing the full path to the file and without prefixing it with python.

countwords data/dracula.txt -n 5
the,8036
and,5896
i,4712
to,4540
of,3738

13.4 What Installation Does

Now that we have created and installed a Python package, let’s explore what actually happens during installation. The short version is that the contents of the package are copied into a directory that Python will search when it imports things. In theory we can “install” packages by manually copying source code into the right places, but it’s much more efficient and safer to use a tool specifically made for this purpose, such as conda or pip.

Most of the time, these tools copy packages into the Python installation’s site-packages directory, but this is not the only place Python searches. Just as the PATH environment in the shell contains a list of directories that the shell searches for programs it can execute (Section 3.6), the Python variable sys.path contains a list of the directories it searches. We can look at this list inside the interpreter:

import sys
sys.path
['',
'/home/amira/anaconda3/envs/zipf/lib/python37.zip',
'/home/amira/anaconda3/envs/zipf/lib/python3.7',
'/home/amira/anaconda3/envs/zipf/lib/python3.7/lib-dynload',
'/home/amira/.local/lib/python3.7/site-packages',
'/home/amira/anaconda3/envs/zipf/lib/python3.7/site-packages',
'/home/amira/zipf']

The empty string at the start of the list means “the current directory”. The rest are system paths for our Python installation, and will vary from computer to computer.

13.5 Distributing Packages

Now that our package can be installed, we should distribute it so that anyone can run pip install zipf and start use it. To do this, we need to use setuptools to create a source distribution (known as an sdist in Python packaging jargon):

python setup.py sdist
running sdist
running egg_info
writing zipf.egg-info/PKG-INFO
writing dependency_links to zipf.egg-info/dependency_links.txt
writing entry points to zipf.egg-info/entry_points.txt
writing requirements to zipf.egg-info/requires.txt
writing top-level names to zipf.egg-info/top_level.txt
package init file 'zipf/__init__.py' not found (or not a regular file)
reading manifest file 'zipf.egg-info/SOURCES.txt'
writing manifest file 'zipf.egg-info/SOURCES.txt'
running check
warning: check: missing required meta-data: url

warning: check: missing meta-data: if 'author' supplied, 'author_email' must be supplied too

creating zipf-0.1.0
creating zipf-0.1.0/zipf
creating zipf-0.1.0/zipf.egg-info
copying files to zipf-0.1.0...
copying README.md -> zipf-0.1.0
copying setup.py -> zipf-0.1.0
copying zipf/collate.py -> zipf-0.1.0/zipf
copying zipf/countwords.py -> zipf-0.1.0/zipf
copying zipf/plotcounts.py -> zipf-0.1.0/zipf
copying zipf/utilities.py -> zipf-0.1.0/zipf
copying zipf.egg-info/PKG-INFO -> zipf-0.1.0/zipf.egg-info
copying zipf.egg-info/SOURCES.txt -> zipf-0.1.0/zipf.egg-info
copying zipf.egg-info/dependency_links.txt -> zipf-0.1.0/zipf.egg-info
copying zipf.egg-info/entry_points.txt -> zipf-0.1.0/zipf.egg-info
copying zipf.egg-info/requires.txt -> zipf-0.1.0/zipf.egg-info
copying zipf.egg-info/top_level.txt -> zipf-0.1.0/zipf.egg-info
Writing zipf-0.1.0/setup.cfg
creating dist
Creating tar archive
removing 'zipf-0.1.0' (and everything under it)

These distribution files can now be distributed via PyPI, the standard repository for Python packages. Before doing that, though, we can put zipf on TestPyPI, which lets us test distribution of our package without having things appear in the main PyPI repository. We must have an account, but they are free to create.

The preferred tool for uploading packages to PyPI is called twine, which we can install with:

$ pip install twine

Following the Python Packaging User Guide, we can now upload our distributions from the dist/ folder using the --repository option to specify the TestPyPI repository:

$ twine upload --repository testpypi dist/*
Enter your username: amirakhan
Enter your passowrd: *********
Uploading distributions to https://test.pypi.org/legacy/
Uploading zipf-0.1.0.tar.gz
100%|█████████████████| 5.59k/5.59k [00:01<00:00, 3.27kB/s]

View at:
https://test.pypi.org/project/zipf/0.1/
Our new project at `https://test.pypi.org/project/zipf/0.1/`

Figure 13.1: Our new project at https://test.pypi.org/project/zipf/0.1/

We have now uploaded both types of distribution, allowing people to use the wheel distribution if their system supports it or the source distribution if it does not. We can test that this has worked by creating a virtual environment and installing our package from TestPyPI:

$ conda create -n zipf-test
$ conda activate zipf-test
(zipf-test)$ pip install --index-url https://test.pypi.org/simple zipf
Looking in indexes: https://test.pypi.org/simple
Collecting zipf
  Downloading https://test-files.pythonhosted.org/packages/aa/fb/352af20b6f4bb13c3f06e7c2f1e1b7ec8a11e771533d5ad05407d48059a9/zipf-0.1.0.tar.gz (3.1 kB)
Requirement already satisfied: matplotlib in /usr/lib/python3.8/site-packages (from zipf) (3.2.1)
Requirement already satisfied: pandas in ./.local/lib/python3.8/site-packages (from zipf) (1.0.3)
Requirement already satisfied: scipy in /usr/lib/python3.8/site-packages (from zipf) (1.4.1)
Requirement already satisfied: pyyaml in /usr/lib/python3.8/site-packages (from zipf) (5.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/lib/python3.8/site-packages (from matplotlib->zipf) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/lib/python3.8/site-packages (from matplotlib->zipf) (1.1.0)
Requirement already satisfied: numpy>=1.11 in /usr/lib/python3.8/site-packages (from matplotlib->zipf) (1.18.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/lib/python3.8/site-packages (from matplotlib->zipf) (2.4.7)
Requirement already satisfied: python-dateutil>=2.1 in /usr/lib/python3.8/site-packages (from matplotlib->zipf) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/lib/python3.8/site-packages (from pandas->zipf) (2019.3)
Requirement already satisfied: six in /usr/lib/python3.8/site-packages (from cycler>=0.10->matplotlib->zipf) (1.14.0)
Requirement already satisfied: setuptools in /usr/lib/python3.8/site-packages (from kiwisolver>=1.0.1->matplotlib->zipf) (46.1.3)
Installing collected packages: zipf
    Running setup.py install for zipf ... done
Successfully installed zipf-0.1.0

Once again, pip takes advantage of the fact that some packages already existing on our system and doesn’t download them again. Once we are happy with how our package appears in TestPyPI (including its project page), we can go through the same process to put it on the main PyPI repository.

conda installation packages

Given the widespread use of conda for package management, it can be a good idea to post a conda installation package to Anaconda Cloud. The conda documentation has instructions for quickly building a conda package for a Python module that is already available on PyPI. See Appendix I for more information about conda and Anaconda Cloud.

13.6 Documenting Packages

An old proverb says, “Trust, but verify.” The equivalent in programming is, “Be clear, but document.” No matter how well software is written, it always embodies decisions that aren’t explicit in the final code or accommodates complications that aren’t going to be obvious to the next reader. Putting it another way, the best function names in the world aren’t going to answer the questions “Why does the software do this?” and “Why doesn’t it do this in a simpler way?”

It’s important to consider who documentation is for. There are three kinds of people in any domain: novices, competent practitioners, and experts (G. Wilson 2019a). A novice doesn’t yet have a mental model of the domain: they don’t know what the key terms are, how they relate, what the causes of their problems are, or how to tell whether a solution to their problem is appropriate or not.

Competent practitioners know enough to accomplish routine tasks with routine effort: they may need to check Stack Overflow every few minutes, but they know what to search for and what “done” looks like. Finally, experts have such a deep and broad understanding of the domain that they can solve routine problems at a glance and are able to handle the one-in-a-thousand cases that would baffle the merely competent.

Each of these three groups needs a different kind of documentation:

  • A novice needs a tutorial that introduces her to key ideas one by one and shows how they fit together.

  • A competent practitioner needs reference guides, cookbooks, and Q&A sites; these give her solutions close enough to what she needs that she can tweak them the rest of the way.

  • Experts need this material as well—nobody’s memory is perfect—but they may also paradoxically want tutorials. The difference between them and novices is that experts want tutorials on how things work and why they were designed that way.

The first thing to decide when writing documentation is therefore to decide which of these needs we are trying to meet. Tutorials like this one should be long-form prose that contain code samples and diagrams. They should use authentic tasks to motivate ideas, i.e., show people things they actually want to do rather than printing the numbers from 1 to 10, and should include regular check-ins so that learners and instructors alike can tell if they’re making progress.

Tutorials help novices build a mental model, but competent practitioners and experts will be frustrated by their slow pace and low information density. They will want single-point solutions to specific problems like how to find cells in a spreadsheet that contain a certain string or how to configure the web server to load an access control module. They can make use of an alphabetical list of the functions in a library, but are much happier if they can search by keyword to find what they need; one of the signs that someone is no longer a novice is that they’re able to compose useful queries and tell if the results are on the right track or not.

False Beginners

A false beginner is someone who appears not to know anything, but who has enough prior experience in other domains to be able to piece things together much more quickly than a genuine novice. Someone who is proficient with MATLAB, for example, will speed through a tutorial on Python’s numerical libraries much more quickly than someone who has never programmed before. Creating documentation for false beginners is especially challenging; if resources permit, the best option is often a translation guide that shows them how they would do a task with the system they know well and then how to do the equivalent task with the new system.

In an ideal world, we would satisfy these needs with a chorus of explanations, some long and detailed, others short and to the point. In our world, though, time and resources are limited, so all but the most popular packages must make do with single explanations. The next sections of this chapter will therefore look at:

  • Writing good docstrings
  • Using the README to provide an overview of the package
  • Automatically generating a reference guide as a webpage
  • Hosting documentation online
  • Leveraging existing solutions to provide an FAQ

13.6.1 Writing Good Docstrings

If we are doing exploratory programming, a short docstring to remind ourselves of each function’s purpose is probably as much documentation as we need. (In fact, it’s probably better than what most people do.) That one- or two-liner should begin with an active verb and describe either how inputs are turned into outputs, or what side effects the function has; as we discuss below, if we need to describe both, we should probably rewrite our function.

An active verb is something like “extract”, “normalize”, or “find”. For example, these are all good one-line docstrings:

  • “Create a list of current ages from a list of birth dates.”
  • “Clip signals to lie in [0…1].”
  • “Reduce the red component of each pixel.”

We can tell our one-liners are useful if we can read them aloud in the order the functions are called in place of the function’s name and parameters.

Once we start writing code for other people (or our future selves) our docstrings should include:

  1. The name and purpose of every public class, function, and constant in our code.
  2. The name, purpose, and default value (if any) of every parameter to every function.
  3. Any side effects the function has.
  4. The type of value returned by every function.
  5. What exceptions those functions can raise and when.

The word “public” in the first rule is important. We don’t have to write full documentation for helper functions that are only used inside our package and aren’t meant to be called by users, but these should still have at least a comment explaining their purpose. We also don’t have to document unit testing functions: as discussed in Chapter 11, these should have long names that describe what they’re checking so that failure reports are easy to scan.

13.6.2 Including Package Level Documentation in the README

When a user first encounters a package, they usually want to know what the package is meant to do, instructions on how to install it, and examples of how to use it. We can include these elements in the README.md file we started in Chapter 6. At the moment it reads as follows:

$ cat README.md
# Zipf's Law

These Zipf's Law scripts tally the occurrences of words in text files
and plot each word's rank versus its frequency.

## Contributors

- Amira Khan <amira@zipf.org>
- Sami Virtanen <sami@zipf.org>

This file is currently written in Markdown because GitHub recognises files ending in .md and displays them nicely. We could continue to do this, but for a Python package we eventually want to create a website for our package documentation (Section 13.6.3). The most popular documentation generator in the Python community uses a format called reStructuredText (reST), so we will switch to that.

Like Markdown, reST is a plain-text markup format that can be rendered into HTML or PDF documents with complex indices and cross-links. GitHub recognizes files ending in .rst as reST files and displays them nicely, so our first task is to rename our existing file:

$ git mv README.md README.rst

We then make a few edits to the file: titles are underlined and overlined, section headings are underlined, and code blocks are set off with two colons (::) and indented:

The ``zipf`` package tallies the occurrences of words in text files
and plots each word's rank versus its frequency
together with a line for the theoretical distribution for Zipf's Law.

Motivation
----------

Zipf’s Law is often stated as an observational pattern seen in the
relationship between the frequency and rank of words in a text:

`"…the most frequent word will occur approximately twice as often
as the second most frequent word,
three times as often as the third most
frequent word, etc."`  
— `wikipedia <https://en.wikipedia.org/wiki/Zipf%27s_law>`_

Many books are available to download in plain text format
from sites such as `Project Gutenberg <https://www.gutenberg.org/>`_,
so we created this package to qualitatively explore how well different books align
with the word frequencies predicted by Zipf's Law.

Installation
------------

``pip install zipf``

Usage
-----

After installing this package,
the following three commands will be available from the command line

- ``countwords`` for counting the occurrences of words in a text.
- ``collate`` for collating multiple word count files together.
- ``plotcounts`` for visualizing the word counts.

A typical usage scenario would include running the following from your terminal::

    countwords dracula.txt > dracula.csv
    countwords moby_dick.txt > moby_dick.csv
    collate dracula.csv moby_dick.csv > collated.csv
    plotcounts collated.csv --outfile zipf-drac-moby.jpg

Additional information on each function
can be found in their docstrings and appending the ``-h`` flag,
e.g. ``countwords -h``.

Contributors
------------

- Amira Khan <amira@zipf.org>
- Sami Virtanen <sami@zipf.org>

13.6.3 Creating a Web Page for Documentation

Docstrings and READMEs are sufficient to describe most simple packages, and are infinitely better than no documentation at all. As our code base grows larger, though, we will want to complement these manually written sections with automatically generated content, references between functions, and search functionality.

The online documentation for most large Python packages is generated using a documentation generator called Sphinx, which is often used in combination with Read The Docs, a free service for hosting online documentation. Let’s install Sphinx and create a docs/ directory at the top of our repository:

$ pip install sphinx
$ mkdir docs
$ cd docs

We can then run Sphinx’s quickstart tool to create a minimal set of documentation that includes the README.rst file we just created and the docstrings we’ve written along the way. It asks us to specify the project’s name, the name of the project’s author, and a release; we can use the default settings for everything else.

$ sphinx-quickstart
Welcome to the Sphinx 3.1.1 quickstart utility.

Please enter values for the following settings (just press Enter to
accept a default value, if one is given in brackets).

Selected root path: .

You have two options for placing the build directory for Sphinx output.
Either, you use a directory "_build" within the root path, or you separate
"source" and "build" directories within the root path.
> Separate source and build directories (y/n) [n]: n

The project name will occur in several places in the built documentation.
> Project name: zipf
> Author name(s): Amira Khan
> Project release []: 0.1

If the documents are to be written in a language other than English,
you can select a language here by its language code. Sphinx will then
translate text that it generates into that language.

For a list of supported codes, see
https://www.sphinx-doc.org/en/master/usage/configuration.html#confval-language.
> Project language [en]:

Creating file /Users/amira/zipf/docs/conf.py.
Creating file /Users/amira/zipf/docs/index.rst.
Creating file /Users/amira/zipf/docs/Makefile.
Creating file /Users/amira/zipf/docs/make.bat.

Finished: An initial directory structure has been created.

You should now populate your master file /Users/amira/zipf/docs/index.rst and create other documentation
source files. Use the Makefile to build the docs, like so:
   make builder
where "builder" is one of the supported builders, e.g. HTML, LaTeX or linkcheck.

quickstart creates a file called conf.py in the docs directory that configures Sphinx. We will make two changes to that file so that another tool called autodoc can find our modules (and their docstrings). The first change relates to the “path setup” section near the head of the file:

# -- Path setup --------------------------------------------------------------

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.

Relative to the docs/ directory, our modules (i.e. countwords.py, utilities.py, etc) are located in the ../zipf directory. We therefore need to uncomment the relevant lines of the path setup section in conf.py to tell Sphinx where those modules are:

import os
import sys

sys.path.insert(0, os.path.abspath('../zipf'))

We will also change the “general configuration” section to add autodoc to the list of Sphinx extensions we want:

extensions = ['sphinx.ext.autodoc']

With those edits complete, we can now generate a Sphinx autodoc script that will read the docstrings from our package and put them in .rst files in the docs/source directory:

sphinx-apidoc -o source/ ../zipf
Creating file source/collate.rst.
Creating file source/countwords.rst.
Creating file source/plotcounts.rst.
Creating file source/test_zipfs.rst.
Creating file source/utilities.rst.
Creating file source/modules.rst.

We are finally ready to generate our webpage. The docs sub-directory contains a Makefile that was generated by sphinx-quickstart. If we run make html and open docs/_build/index.html in a web broswer we’ll have a website landing page with minimal documentation (Figure 13.2). If we click on the Module Index link we can access the documentation for the individual modules (Figures 13.3 and 13.4).

The default website landing page

Figure 13.2: The default website landing page

The module index

Figure 13.3: The module index

The countwords documentation

Figure 13.4: The countwords documentation

The landing page for the website is the perfect place for the content of our README file, so we can add the line .. include:: ../README.rst to the docs/index.rst file to insert it:

Welcome to Zipf's documentation!
==================================

.. include:: ../README.rst

.. toctree::
   :maxdepth: 2
   :caption: Contents:

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

If we re-run make html, we now get an updated set of web pages that re-uses our README as the introduction to the documentation (Figure 13.5).

The new landing page showing the contents of `README.rst`

Figure 13.5: The new landing page showing the contents of README.rst

Before going on, note that Sphinx is not included in the installation requirements in requirements.txt (Section 11.9). Sphinx isn’t needed to run, develop, or even test our package, but it is needed for building the documentation. To note this requirement, but without requiring everyone installing the package to install Sphinx, let’s create a requirements_docs.txt file that contains this line (where the version number is found by running pip freeze):

Sphinx>=1.7.4

Anyone wanting to build the documentation (including us, on another computer) now only needs run pip install -r requirement_docs.txt

13.6.4 Hosting Documentation Online

We can host the documentation for our project in several ways. As mentioned above, A very common option for Python projects is Read The Docs, a community-supported site that hosts software documentation free of charge.

Just as continuous integration systems automatically re-test things (Section 11.9), Read The Docs integrates with GitHub so that documentation is automatically re-built every time updates are pushed to the project’s GitHub repository. If we register for Read The Docs with our GitHub account, we can import a project from our GitHub repository. Read The Docs will then build the documentation using make html and host the resulting files.

For this to work, all of the source files need to be checked into your GitHub repository: in our case, this means docs/source/*.rst, docs/Makefile, docs/conf.py, and docs/index.rst. We also need to create and save a Read the Docs configuration file in the root directory of our zipf package:

$ pwd
/Users/amira/zipf
$ cat .readthedocs.yml
# .readthedocs.yml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Build documentation in the docs/ directory with Sphinx
sphinx:
  configuration: docs/conf.py

# Optionally set the version of Python and requirements required to build your docs
python:
  version: 3.7
  install:
    - requirements: requirements.txt

The configuration file uses the now-familiar YAML format (Section 9.1 and Appendix L) to specify the location of the Sphinx configuration script (docs/conf.py) and the dependencies for our package (requirements.txt). If we named our project zipf-docs, our documentation is now available at https://zipf-docs.readthedocs.io/en/latest/.

13.6.5 Creating a FAQ

As projects grow, documentation within functions alone may be unsufficient for users to apply code to their own problems. One strategy to assist other people with understanding a project is with an FAQ: a list of frequently-asked questions and corresponding answers. A good FAQ uses the terms and concepts that people bring to the software rather than the vocabulary of its authors; putting it another way, the questions should be things that people might search for online, and the answers should give them enough information to solve their problem.

Creating and maintaining a FAQ is a lot of work, and unless the community is large and active, a lot of that effort may turn out to be wasted, because it’s hard for the authors or maintainers of a piece of software to anticipate what newcomers will be mystified by. A better approach is to leverage sites like Stack Overflow, which is where most programmers are going to look for answers anyway:

  1. Post every question that someone actually asks us, whether it’s online, by email, or in person. Be sure to include the name of the software package in the question so that it’s findable.

  2. Answer the question, making sure to mention which version of the software we’re talking about (so that people can easily spot and discard stale answers in the future).

Stack Overflow’s guide to asking a good question has been refined over many years, and is a good guide for any project:

Write the most specific title we can.
“Why does division sometimes give a different result in Python 2.7 and Python 3.5?” is much better than, “Help! Math in Python!!”
Give context before giving sample code.
A few sentences to explain what are are trying to do and why will help people determine if their question is a close match to ours or not.
Provide a minimal reprex.
Section 7.6 explains the value of a reproducible example, and why reprexes should be as short as possible. Readers will have a much easier time figuring out if this question and its answers are for them if they can see and understand a few lines of code.
Tag, tag, tag.
Keywords make everything more findable, from scientific papers to left-handed musical instruments.
Use “I” and question words (how/what/when/where/why).
Writing this way forces us to think more clearly about what someone might actually be thinking when they need help.
Keep each item short.
The “minimal manual” approach to instructional design (Carroll 2014) breaks everything down into single-page steps, with half of that page devoted to troubleshooting. This may feel trivializing to the person doing the writing, but is often as much as a person searching and reading can handle. It also helps writers realize just how much implicit knowledge they are assuming.
Allow for a chorus of explanations.
As discussed earlier, users are all different from one another, and are therefore best served by a chorus of explanations. Do not be afraid of providing multiple explanations to a single question that suggest different approaches or are written for different prior levels of understanding.

13.7 Software Journals

As a final step to releasing our new package, we might want to give it a DOI so that it can be cited by researchers. As we saw in Section 12.2.3, GitHub integrates with Zenodo for precisely this purpose.

While creating a DOI using a site like Zenodo is often the end of the software publishing process, there is the option of publishing a journal paper to describe the software in detail. Some research disciplines have journals devoted to describing particular types of software (e.g., Geoscientific Model Development), and there are also a number of generic software journals such as the Journal of Open Research Software and the Journal of Open Source Software. Packages submitted to these journals are typically assessed against a range of criteria relating to how easy the software is to install and how well it is documented, so the peer review process can be a great way to get critical feedback from people who have seen many research software packages come and go over the years.

Once you have obtained a DOI and possibly published with a software journal, the last step is to tell users how to cite your new software package. This is traditionally done by adding a CITATION file to the associated GitHub repository (alongside README, LICENSE, CONDUCT and similar files discussed in Section 1.4.1), containing a plain text citation that can be copied and pasted into email as well as entries formatted for various bibliographic systems like BibTeX.

$ cat CITATION.md
# Citation

If you use the Zipf package for work/research presented in a publication,
we ask that you please cite:

Khan, A., and Virtanen, S., 2020. Zipf: A Python package for word count analysis.
*Journal of Important Software*, 5(51), 2317, https://doi.org/10.21105/jois.02317

### BibTeX entry

    @article{Khan2020,
        title={Zipf: A Python package for word count analysis.},
        author={Khan, Amira and Virtanen, Sami},
        journal={Journal of Important Software},
        volume={5},
        number={51},
        eid={2317},
        year={2020},
        doi={10.21105/jois.02317},
       url={https://doi.org/10.21105/jois.02317},
    }

13.8 Summary

Thousands of people have helped write the software that our Zipf’s Law example relies on, but their work is only useful because they packaged it and documented how to use it. Doing this is increasingly recognized as a credit-worthy activity by universities, government labs, and other organizations, particularly for research software engineers. It is also deeply satisfying to make strangers’ lives better, if only in small ways.

13.9 Exercises

13.9.1 Fixing warnings

When we ran python setup.py sdist in Section 13.5, setup.py warned us about some missing metadata. Review its output and then fix the problem.

13.9.2 Separating requirements

As well as requirements_docs.txt, developers often create a requirements_dev.txt file to list packages that are not needed by the package’s users, but are required for its development and testing. Pull pytest out of requirements.txt and put it in a new requirements_dev.txt file, using pip freeze to find the minimum required version.

13.9.3 Software review

The Journal of Open Source Software has a checklist that reviewers must follow when assessing a submitted software paper. Run through the checklist (skipping the criteria related to the software paper) and see how the Zipf’s Law package would rate on each criteria.

13.9.4 Data packages

R provides many data packages that can be loaded like any other library but provide a dataset instead of (or as well as) runnable code. Create a Python package called collated that provides a single function getData, so that:

import collated
counts = collated.getData()

assigns a dictionary of word frequencies to the variable counts using the data from the Zipf project’s collated.csv. Do not copy the data into a Python file; instead, have getData read the data from a copy of the CSV file distributed as part of the package. (Hint: the special variable collated.__path__ holds the directory containing the installed collated package.)

13.9.5 Staying up to date

  1. Run pip list to get a list of the Python packages you have installed. How many are there?

  2. Run pip list -o to get a list of packages that are out of date. (This may take a few seconds.) How many are there, and how can you update them?

13.10 Key Points

  • Use setuptools to build and distribute Python packages.
  • Create a directory named mypackage containing a setup.py script as well as a subdirectory also called mypackage containing the package’s source files.
  • Use semantic versioning for software releases.
  • Use a virtual environment to test how your package installs without disrupting your main Python installation.
  • Use pip to install Python packages.
  • The default respository for Python packages is PyPI.
  • Use TestPyPI to test the distribution of your package.
  • Decide whether your documentation is for novices, competent practitioners, and/or experts.
  • Use docstrings to document modules and functions.
  • Use a README file for package-level documentation.
  • Use Sphinx to generate documentation for a package.
  • Use Read The Docs to host package documentation online.
  • Create a DOI for your package using GitHub’s Zenodo integration.
  • Publish the details of your package in a software journal so that others can cite it.