Chapter 14 Creating Packages with Python

Another response of the wizards, when faced with a new and unique situation, was to look through their libraries to see if it had ever happened before. This was…a good survival trait. It meant that in times of danger you spent the day sitting very quietly in a building with very thick walls.

— Terry Pratchett

The more software we write, the more we think of a programming language as a way to build and combine libraries. Every widely-used language now has an online repository from which people can download and install those libraries. This lesson shows you how to use Python’s tools to create and share libraries of your own.

We will continue with our Zipf’s Law project, which should include the following files:

zipf/
├── .gitignore
├── CONDUCT.md
├── CONTRIBUTING.md
├── KhanVirtanen2020.md
├── LICENSE.md
├── Makefile
├── README.md
├── environment.yml
├── requirements.txt
├── bin
│   ├── book_summary.sh
│   ├── collate.py
│   ├── countwords.py
│   ├── plotcounts.py
│   ├── plotparams.yml
│   ├── script_template.py
│   ├── test_zipfs.py
│   └── utilities.py
├── data
│   ├── README.md
│   ├── dracula.txt
│   └── ...
├── results
│   ├── dracula.csv
│   ├── dracula.png
│   └── ...
└── test_data
    ├── random_words.txt
    └── risk.txt

14.1 Creating a Python Package

A package consists of one or more Python source files in a specific directory structure combined with installation instructions for the computer. Python packages can come from various sources: some are distributed with Python itself, but anyone can create one, and there are thousands that can be downloaded and installed from online repositories.

Terminology

People sometimes refer to packages as modules. Strictly speaking, a module is a single source file, while a package is a directory structure that contains one or more modules.

A generic package folder hierarchy looks like this:

pkg_name
├── pkg_name
│   ├── module1.py
│   └── module2.py
├── README.md
└── setup.py

The top-level directory is named after the package. It contains a directory that is also named after the package, and that contains the package’s source files. It is initially a little confusing to have two directories with the same name, but most Python projects follow this convention because it makes it easier to set up the project for installation. We can get this structure in our Zipf’s Law project by renaming zipf/bin to zipf.

__init__.py

Python packages often contain a file with a special name: __init__.py (two underscores before and after init). Just as importing a module file executes the code in the module, importing a package executes the code in __init__.py. Packages had to have this file before Python 3.3, even if it was empty, but since Python 3.3 it is only needed if we want to run some code as the package is being imported.

To make the Zipf’s Law project work as a Python package, we only need to make one important change to the code itself: changing the syntax for how we import our own modules. Currently, both collate.py and countwords.py contains this line,

import utilities

while test_zipfs.py contains:

import plotcounts
import countwords

These are called implicit relative imports, because it is not clear whether we mean “import a Python package called utilities” or “import a file in our local directory called utilities.py” (which is what we want). To remove this ambiguity we need to be explicit and write,

from zipf import utilities

and

from zipf import plotcounts
from zipf import countwords

These are absolute imports since we are specifying the full location of utilities, plotcounts and countwords inside the zipf package. Absolute imports are the preferred way for parts of a package to import other parts, but we can also use explicit relative imports, which require a little less typing and can sometimes make it easier to restructure very large projects:

from . import utilities

Here, the . signals that utilities exists in the current directory.

Python has several ways to build an installable package. We will show how to use setuptools, which is the lowest common denominator and will allow everyone, regardless of what Python distribution they have, to use our package. To use setuptools, we must create a file called setup.py in the directory above the root directory of the package. (This is why we require the two-level directory structure described earlier.) setup.py must have exactly that name, and must contain lines like these:

from setuptools import setup


setup(
    name='zipf',
    version='0.1.0',
    author='Amira Khan',
    packages=['zipf'])

The name and author parameters are self-explanatory. Most software projects use semantic versioning for software releases. A version number consists of three integers X.Y.Z, where X is the major version, Y is the minor version, and Z is the patch version. Major version zero (0.Y.Z) is for initial development, so we have started with 0.1.0. The first stable public release would be version 1.0.0, and in general, the version number is incremented as follows:

  • Increment major every time there’s an incompatible externally-visible change
  • Increment minor when adding new functionality in a backwards-compatible manner (i.e. without breaking any existing code)
  • Increment patch for backwards-compatible bug fixes that don’t add any new features

Finally, we specify the name of the directory containing the code to be packaged with the packages parameter. This is straightforward in our case because we only have a single package directory. For more complex projects, the find_packages function from setuptools can automatically find all packages by recursively searching the current directory.

14.2 Virtual Environments

We can add additional information to our package later, but this is enough to be able to build it for testing purposes. Before we do that, though, we should create a virtual environment to test how our package installs without breaking anything in our main Python installation.

A virtual environment is a layer on top of an existing Python installation. Whenever Python needs to find a package, it looks in the virtual environment before checking the main Python installation. This gives us a place to install packages that only some projects need without affecting other projects.

Virtual environments also help with package development:

  • We want to be able to easily test install and uninstall our package, without affecting the entire Python environment.
  • We want to answer problems people have with our package with something more helpful than “I don’t know, it works for me.” By installing and running our package in a completely empty environment, we can ensure that we’re not accidentally relying on other packages being installed.

We can manage virtual environments using conda (Appendix E). To create a new virtual environment called zipf we run conda create, specifying the environment’s name with the -n or --name flag and listing python as the base to build on:

$ conda create -n zipf python
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/amira/anaconda3/envs/zipf

...

Proceed ([y]/n)? y

...

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate zipf
#
# To deactivate an active environment, use
#
#     $ conda deactivate

conda creates the directory ~/anaconda3/envs/zipf, which contains the subdirectories needed for a minimal Python installation, such as bin and lib. It also creates ~/anaconda3/envs/zipf/bin/python, which checks for packages in these directories before checking the main installation.

We can switch to the zipf environment by running:

$ conda activate zipf

Once we have done this, the python command runs the interpreter in zipf/bin:

(zipf)$ which python
/home/amira/anaconda3/envs/zipf/bin/python

Notice that every shell command displays (zipf) when that virtual environment is active. Between Git branches and virtual environments, it can be very easy to lose track of what exactly we are working on and with. Prompts like this can make it a little less confusing; using virtual environment names that match the names of your projects (and branches, if you’re testing different environments on different branches) quickly becomes essential.

We can now install packages safely. Everything we install will go into zipf virtual environment without affecting the underlying Python installation. When we are done, we can switch back to the default environment using conda deactivate:

(zipf)$ conda deactivate
$ which python
/usr/bin/python

14.3 Installing a Development Package

Let’s install our package inside this virtual environment. First we re-activate it:

$ conda activate zipf

Next, we go into the upper zipf directory that contains our setup.py file and install our package using pip install -e .. The -e option indicates that we want to to install the package in “editable” mode, which means that any changes we make in the package code are directly available to use without having to reinstall the package; the . means “install from the current directory”:

(zipf)$ cd zipf
(zipf)$ pip install -e .
Processing /home/amira/proj/py-rse/zipf/zipf
Building wheels for collected packages: zipf
  Building wheel for zipf (setup.py) ... done
  Created wheel for zipf: filename=zipf-0.1.0-py3-none-any.whl 
    size=4574 sha256=b7d645f1d07775
  Stored in directory: /tmp/pip-ephem-wheel/wheels/a8/a6/0e/8b2a5cb
Successfully built zipf
Installing collected packages: zipf
Successfully installed zipf-0.1.0

If we look in ~/anaconda3/envs/zipf/lib/python3.8/site-packages/, we can see the zipf package beside all the other locally-installed packages. If we try to use the package at this stage, though, Python will complain that some of the packages it depends on, such as pandas, are not installed. We could install these manually, but it is more reliable to automate this process by listing everything that our package depends on using the install_requires parameter in setup.py:

from setuptools import setup


setup(
    name='zipf',
    version='0.1',
    author='Amira Khan',
    packages=['zipf'],
    install_requires=[
        'matplotlib',
        'pandas',
        'scipy',
        'pyyaml',
        'pytest'])

We don’t have to list numpy explicitly because it will be installed as a dependency for pandas and scipy.

Versioning Dependencies

It is good practice to specify the versions of our dependencies and even better to specify version ranges. For example, if we have only tested our package on pandas version 1.0.1, we could put pandas==1.0.1 or pandas>=1.0.1 instead of just pandas in the list argument passed to the install_requires parameter.

We can now install our package and all its dependencies in a single command:

(zipf)$ pip install -e .
Obtaining file:///home/amira/zipf
Collecting matplotlib
  Downloading matplotlib-3.2.1-cp37-cp37m-manylinux1_x86_64.whl (12.4 MB)
     |████████████████████████████████| 12.4 MB 1.9 MB/s
Collecting pandas
  Downloading pandas-1.0.3-cp37-cp37m-manylinux1_x86_64.whl (10.0 MB)
     |████████████████████████████████| 10.0 MB 16.1 MB/s
Collecting scipy
  Downloading scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1 MB)
     |████████████████████████████████| 26.1 MB 11.4 MB/s
Requirement already satisfied: pyyaml in /home/amira/anaconda3/envs/zipf/lib/python3.7/site-packages (from zipf==0.1) (5.3.1)
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1
  Using cached pyparsing-2.4.6-py2.py3-none-any.whl (67 kB)
Collecting python-dateutil>=2.1
  Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting kiwisolver>=1.0.1
  Downloading kiwisolver-1.2.0-cp37-cp37m-manylinux1_x86_64.whl (88 kB)
     |████████████████████████████████| 88 kB 8.6 MB/s
Collecting cycler>=0.10
  Using cached cycler-0.10.0-py2.py3-none-any.whl (6.5 kB)
Collecting numpy>=1.11
  Downloading numpy-1.18.2-cp37-cp37m-manylinux1_x86_64.whl (20.2 MB)
     |████████████████████████████████| 20.2 MB 16.3 MB/s
Requirement already satisfied: pytz>=2017.2 in 
  /home/amira/anaconda3/envs/zipf/lib/python3.7/site-packages 
  (from pandas->zipf==0.1) (2019.3)
Requirement already satisfied: six>=1.5 in
  /home/amira/anaconda3/envs/zipf/lib/python3.7/site-packages 
  (from python-dateutil>=2.1->matplotlib->zipf==0.1) (1.14.0)
Installing collected packages: pyparsing, python-dateutil, 
  kiwisolver, cycler, numpy, matplotlib, pandas, scipy, zipf
  Running setup.py develop for zipf
Successfully installed cycler-0.10.0 kiwisolver-1.2.0 
  matplotlib-3.2.1 numpy-1.18.2 pandas-1.0.3 pyparsing-2.4.6 
  python-dateutil-2.8.1 scipy-1.4.1 zipf

(The precise output of this command will change depending on which versions of our dependencies get installed.)

We can now import our package in a script or a Jupyter notebook just as we would any other package. For example, to use the function in utilities, we would write:

from zipf import utilities


utilities.collection_to_csv(...)

However, the useful command-line scripts that we used to count and plot word counts are no longer accessible directly from the terminal. Fortunately, the setuptools package allows us to install programs along with the package. These programs are placed beside those of other packages. We tell setuptools to do this by defining entry points:

from setuptools import setup


setup(
    name='zipf',
    version='0.1',
    author='Amira Khan',
    packages=['zipf'],
    install_requires=[
        'matplotlib',
        'pandas',
        'scipy',
        'pyyaml',
        'pytest'],
    entry_points={
        'console_scripts': [
            'countwords = zipf.countwords:main',
            'collate = zipf.collate:main',
            'plotcounts = zipf.plotcounts:main']})

The right side of the = operator is the location of a function, written as package.module:function; the left side is the name we want to use to call this function from the command line. In this case we want to call each module’s main, which as it stands requires an input argument args containing the command-line arguments given by the user (Section 5.2). For example, the relevant section of our countwords.py program is:

def main(args):
    """Run the command line program."""
    with args.infile as reader:
        word_counts = count_words(reader)
    utilities.collection_to_csv(word_counts, num=args.num)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=argparse.FileType('r'), 
                        nargs='?', default='-', 
                        help='Input file name')
    parser.add_argument('-n', '--num', type=int, default=None,
                        help='Limit output to N most frequent words')
    args = parser.parse_args()
    main(args)

We can’t pass any arguments to main when we define entry points in our setup.py file, so we need to change this slightly:

def parse_command_line():
    """Parse the command line for input arguments."""
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=argparse.FileType('r'), 
                        nargs='?', default='-', 
                        help='Input file name')
    parser.add_argument('-n', '--num', type=int, default=None,
                        help='Limit output to N most frequent words')
    args = parser.parse_args()
    return args

def main():
    """Run the command line program."""
    args = parse_command_line()
    with args.infile as reader:
        word_counts = count_words(reader)
    utilities.collection_to_csv(word_counts, num=args.num)

if __name__ == '__main__':
    main()

Once we have made the corresponding change in collate.py and plotcounts.py, we can re-install our package:

(zipf)$ pip install -e .
Defaulting to user installation because normal site-packages is
  not writeable
Obtaining file:///home/amira/zipf
Requirement already satisfied: matplotlib in 
  /usr/lib/python3.8/site-packages (from zipf==0.1) (3.2.1)
Requirement already satisfied: pandas in 
  /home/amira/.local/lib/python3.8/site-packages 
  (from zipf==0.1) (1.0.3)
Requirement already satisfied: scipy in 
  /usr/lib/python3.8/site-packages (from zipf==0.1) (1.4.1)
Requirement already satisfied: pyyaml in 
  /usr/lib/python3.8/site-packages (from zipf==0.1) (5.3.1)
Requirement already satisfied: cycler>=0.10 in 
  /usr/lib/python3.8/site-packages 
  (from matplotlib->zipf==0.1) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in 
  /usr/lib/python3.8/site-packages
  (from matplotlib->zipf==0.1) (1.1.0)
Requirement already satisfied: numpy>=1.11 in 
  /usr/lib/python3.8/site-packages
  (from matplotlib->zipf==0.1) (1.18.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in
  /usr/lib/python3.8/site-packages 
  (from matplotlib->zipf==0.1) (2.4.6)
Requirement already satisfied: python-dateutil>=2.1 in 
  /usr/lib/python3.8/site-packages 
  (from matplotlib->zipf==0.1) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in 
  /usr/lib/python3.8/site-packages 
  (from pandas->zipf==0.1) (2019.3)
Requirement already satisfied: six in 
  /usr/lib/python3.8/site-packages 
  (from cycler>=0.10->matplotlib->zipf==0.1) (1.14.0)
Requirement already satisfied: setuptools in 
  /usr/lib/python3.8/site-packages 
  (from kiwisolver>=1.0.1->matplotlib->zipf==0.1) (46.1.3)
Installing collected packages: zipf
  Running setup.py develop for zipf
Successfully installed zipf

The output looks slightly different than the first run because pip could re-use some packages saved locally by the previous install rather than re-fetching them from online repositories. (If we hadn’t used the -e option to make the package immediately editable, we would have to uninstall it before reinstalling it during development.)

We can now use our commands directly from the terminal without writing the full path to the file and without prefixing it with python.

countwords data/dracula.txt -n 5
the,8036
and,5896
i,4712
to,4540
of,3738

14.4 What Installation Does

Now that we have created and installed a Python package, let’s explore what actually happens during installation. The short version is that the contents of the package are copied into a directory that Python will search when it imports things. In theory we can “install” packages by manually copying source code into the right places, but it’s much more efficient and safer to use a tool specifically made for this purpose, such as conda or pip.

Most of the time, these tools copy packages into the Python installation’s site-packages directory, but this is not the only place Python searches. Just as the PATH environment in the shell contains a list of directories that the shell searches for programs it can execute (Section 4.6), the Python variable sys.path contains a list of the directories it searches. We can look at this list inside the interpreter:

import sys
sys.path
['',
'/home/amira/anaconda3/envs/zipf/lib/python37.zip',
'/home/amira/anaconda3/envs/zipf/lib/python3.7',
'/home/amira/anaconda3/envs/zipf/lib/python3.7/lib-dynload',
'/home/amira/.local/lib/python3.7/site-packages',
'/home/amira/anaconda3/envs/zipf/lib/python3.7/site-packages',
'/home/amira/zipf']

The empty string at the start of the list means “the current directory.” The rest are system paths for our Python installation, and will vary from computer to computer.

14.5 Distributing Packages

Now that our package can be installed, we should distribute it so that anyone can run pip install zipf and start use it. To do this, we need to use setuptools to create a source distribution (known as an sdist in Python packaging jargon):

python setup.py sdist
running sdist
running egg_info
writing zipf.egg-info/PKG-INFO
writing dependency_links to zipf.egg-info/dependency_links.txt
writing entry points to zipf.egg-info/entry_points.txt
writing requirements to zipf.egg-info/requires.txt
writing top-level names to zipf.egg-info/top_level.txt
package init file 'zipf/__init__.py' not found (or not a regular file)
reading manifest file 'zipf.egg-info/SOURCES.txt'
writing manifest file 'zipf.egg-info/SOURCES.txt'
running check
warning: check: missing required meta-data: url

warning: check: missing meta-data: if 'author' supplied,
  'author_email' must be supplied too

creating zipf-0.1.0
creating zipf-0.1.0/zipf
creating zipf-0.1.0/zipf.egg-info
copying files to zipf-0.1.0...
copying README.md -> zipf-0.1.0
copying setup.py -> zipf-0.1.0
copying zipf/collate.py -> zipf-0.1.0/zipf
copying zipf/countwords.py -> zipf-0.1.0/zipf
copying zipf/plotcounts.py -> zipf-0.1.0/zipf
copying zipf/utilities.py -> zipf-0.1.0/zipf
copying zipf.egg-info/PKG-INFO -> zipf-0.1.0/zipf.egg-info
copying zipf.egg-info/SOURCES.txt -> zipf-0.1.0/zipf.egg-info
copying zipf.egg-info/dependency_links.txt -> zipf-0.1.0/zipf.egg-info
copying zipf.egg-info/entry_points.txt -> zipf-0.1.0/zipf.egg-info
copying zipf.egg-info/requires.txt -> zipf-0.1.0/zipf.egg-info
copying zipf.egg-info/top_level.txt -> zipf-0.1.0/zipf.egg-info
Writing zipf-0.1.0/setup.cfg
creating dist
Creating tar archive
removing 'zipf-0.1.0' (and everything under it)

These distribution files can now be distributed via PyPI, the standard repository for Python packages. Before doing that, though, we can put zipf on TestPyPI, which lets us test distribution of our package without having things appear in the main PyPI repository. We must have an account, but they are free to create.

The preferred tool for uploading packages to PyPI is called twine, which we can install with:

$ pip install twine

Following the Python Packaging User Guide, we can now upload our distributions from the dist/ folder using the --repository option to specify the TestPyPI repository:

$ twine upload --repository testpypi dist/*
Enter your username: amirakhan
Enter your password: *********
Uploading distributions to https://test.pypi.org/legacy/
Uploading zipf-0.1.0.tar.gz
100%|█████████████████| 5.59k/5.59k [00:01<00:00, 3.27kB/s]

View at:
https://test.pypi.org/project/zipf/0.1/
Our new project on TestPyPI.

Figure 14.1: Our new project on TestPyPI.

We have now uploaded both types of distribution, allowing people to use the wheel distribution if their system supports it or the source distribution if it does not. We can test that this has worked by creating a virtual environment and installing our package from TestPyPI:

$ conda create -n zipf-test
$ conda activate zipf-test
(zipf-test)$ pip install --index-url https://test.pypi.org/simple zipf
Looking in indexes: https://test.pypi.org/simple
Collecting zipf
  Downloading https://test-files.pythonhosted.org/packages/aa/fb/352af20b6f/zipf-0.1.0.tar.gz (3.1 kB)
Requirement already satisfied: matplotlib in /usr/lib/python3.8/site-packages (from zipf) (3.2.1)
Requirement already satisfied: pandas in ./.local/lib/python3.8/site-packages (from zipf) (1.0.3)
Requirement already satisfied: scipy in /usr/lib/python3.8/site-packages (from zipf) (1.4.1)
Requirement already satisfied: pyyaml in /usr/lib/python3.8/site-packages (from zipf) (5.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/lib/python3.8/site-packages (from matplotlib->zipf) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/lib/python3.8/site-packages (from matplotlib->zipf) (1.1.0)
Requirement already satisfied: numpy>=1.11 in /usr/lib/python3.8/site-packages (from matplotlib->zipf) (1.18.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/lib/python3.8/site-packages (from matplotlib->zipf) (2.4.7)
Requirement already satisfied: python-dateutil>=2.1 in /usr/lib/python3.8/site-packages (from matplotlib->zipf) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/lib/python3.8/site-packages (from pandas->zipf) (2019.3)
Requirement already satisfied: six in /usr/lib/python3.8/site-packages (from cycler>=0.10->matplotlib->zipf) (1.14.0)
Requirement already satisfied: setuptools in /usr/lib/python3.8/site-packages (from kiwisolver>=1.0.1->matplotlib->zipf) (46.1.3)
Installing collected packages: zipf
    Running setup.py install for zipf ... done
Successfully installed zipf-0.1.0

Once again, pip takes advantage of the fact that some packages already existing on our system and doesn’t download them again. Once we are happy with how our package appears in TestPyPI (including its project page), we can go through the same process to put it on the main PyPI repository.

conda installation packages

Given the widespread use of conda for package management, it can be a good idea to post a conda installation package to Anaconda Cloud. The conda documentation has instructions for quickly building a conda package for a Python module that is already available on PyPI. See Appendix E for more information about conda and Anaconda Cloud.

14.6 Documenting Packages

Now that our package has been distributed for others to use, we need to think about whether we have provided sufficient documentation. Docstrings (Section 5.3) and READMEs are sufficient to describe most simple packages, but as our code base grows larger we will want to complement these manually written sections with automatically generated content, references between functions, and search functionality. For most large Python packages, such documentation is generated using a documentation generator called Sphinx, which is often used in combination with a free online hosting service called Read The Docs. In this section we will update our README file with some basic package-level documentation, before using Sphinx and Read The Docs to host that information online along with more detailed function-level documentation. For further advice on writing documentation for larger and more complex packages, see Appendix I.

14.6.1 Including package-level documentation in the README

When a user first encounters a package, they usually want to know what the package is meant to do, instructions on how to install it, and examples of how to use it. We can include these elements in the README.md file we started in Chapter 7. At the moment it reads as follows:

$ cat README.md
# Zipf's Law

These Zipf's Law scripts tally the occurrences of words in text
files and plot each word's rank versus its frequency.

## Contributors

- Amira Khan <amira@zipf.org>
- Sami Virtanen <sami@zipf.org>

This file is currently written in Markdown, but Sphinx uses a format called reStructuredText (reST), so we will switch to that. Like Markdown, reST is a plain-text markup format that can be rendered into HTML or PDF documents with complex indices and cross-links. GitHub recognizes files ending in .rst as reST files and displays them nicely, so our first task is to rename our existing file:

$ git mv README.md README.rst

We then make a few edits to the file: titles are underlined and overlined, section headings are underlined, and code blocks are set off with two colons (::) and indented:

The ``zipf`` package tallies the occurrences of words in text files
and plots each word's rank versus its frequency
together with a line for the theoretical distribution for Zipf's Law.

Motivation
----------

Zipf's Law is often stated as an observational pattern seen in the
relationship between the frequency and rank of words in a text:

`"…the most frequent word will occur approximately twice as often
as the second most frequent word,
three times as often as the third most
frequent word, etc."`  
— `wikipedia <https://en.wikipedia.org/wiki/Zipf%27s_law>`_

Many books are available to download in plain text format
from sites such as `Project Gutenberg <https://www.gutenberg.org/>`_,
so we created this package to qualitatively explore how well
different books align with the word frequencies predicted by
Zipf's Law.

Installation
------------

``pip install zipf``

Usage
-----

After installing this package, the following three commands will
be available from the command line

- ``countwords`` for counting the occurrences of words in a text.
- ``collate`` for collating multiple word count files together.
- ``plotcounts`` for visualizing the word counts.

A typical usage scenario would include running the following from
your terminal::

    countwords dracula.txt > dracula.csv
    countwords moby_dick.txt > moby_dick.csv
    collate dracula.csv moby_dick.csv > collated.csv
    plotcounts collated.csv --outfile zipf-drac-moby.jpg

Additional information on each function
can be found in their docstrings and appending the ``-h`` flag,
e.g. ``countwords -h``.

Contributors
------------

- Amira Khan <amira@zipf.org>
- Sami Virtanen <sami@zipf.org>

14.6.2 Creating a web page for documentation

Now that we’ve added package-level documentation to our README file, we need to think about function-level documentation. We want to provide users with a list of all the functions available in our package along with a short description of what they do and how to use them. We could achieve this by manually cutting and pasting function names and docstrings from our Python modules (i.e. countwords.py, plotcounts.py, etc), but that would be a time consuming process prone to errors as more functions are added over time. Instead, we can use a documentation generator called Sphinx that is capable of scanning Python code for function names and docstrings and can export that information to HTML format for hosting on the web.

To start, let’s install Sphinx and create a docs/ directory at the top of our repository:

$ pip install sphinx
$ mkdir docs
$ cd docs

We can then run Sphinx’s quickstart tool to create a minimal set of documentation that includes the package-level information in the README.rst file we just created and the function-level information in the docstrings we’ve written along the way. It asks us to specify the project’s name, the name of the project’s author, and a release; we can use the default settings for everything else.

$ sphinx-quickstart
Welcome to the Sphinx 3.1.1 quickstart utility.

Please enter values for the following settings (just press Enter
to accept a default value, if one is given in brackets).

Selected root path: .

You have two options for placing the build directory for Sphinx
output. Either, you use a directory "_build" within the root
path, or you separate "source" and "build" directories within the
root path.
> Separate source and build directories (y/n) [n]: n

The project name will occur in several places in the built
documentation.
> Project name: zipf
> Author name(s): Amira Khan
> Project release []: 0.1

If the documents are to be written in a language other than
English, you can select a language here by its language code.
Sphinx will then translate text that it generates into that
language.

For a list of supported codes, see
https://www.sphinx-doc.org/en/master/usage/configuration.html#confval-language.
> Project language [en]:

Creating file /Users/amira/zipf/docs/conf.py.
Creating file /Users/amira/zipf/docs/index.rst.
Creating file /Users/amira/zipf/docs/Makefile.
Creating file /Users/amira/zipf/docs/make.bat.

Finished: An initial directory structure has been created.

You should now populate your master file
/Users/amira/zipf/docs/index.rst and create other documentation
source files. Use the Makefile to build the docs, like so:
   make builder
where "builder" is one of the supported builders, e.g. HTML,
LaTeX or linkcheck.

quickstart creates a file called conf.py in the docs directory that configures Sphinx. We will make two changes to that file so that another tool called autodoc can find our modules (and their docstrings). The first change relates to the “path setup” section near the head of the file:

# -- Path setup -------------------------------------------------

# If extensions (or modules to document with autodoc) are in
# another directory, add these directories to sys.path here. If the
# directory is relative to the documentation root, use
# os.path.abspath to make it absolute, like shown here.

Relative to the docs/ directory, our modules (i.e. countwords.py, utilities.py, etc) are located in the ../zipf directory. We therefore need to uncomment the relevant lines of the path setup section in conf.py to tell Sphinx where those modules are:

import os
import sys

sys.path.insert(0, os.path.abspath('../zipf'))

We will also change the “general configuration” section to add autodoc to the list of Sphinx extensions we want:

extensions = ['sphinx.ext.autodoc']

With those edits complete, we can now generate a Sphinx autodoc script that generates information about each of our modules and puts it in corresponding .rst files in the docs/source directory:

sphinx-apidoc -o source/ ../zipf
Creating file source/collate.rst.
Creating file source/countwords.rst.
Creating file source/plotcounts.rst.
Creating file source/test_zipfs.rst.
Creating file source/utilities.rst.
Creating file source/modules.rst.

At this point, we are ready to generate our webpage. The docs sub-directory contains a Makefile that was generated by sphinx-quickstart. If we run make html and open docs/_build/index.html in a web browser we’ll have a website landing page with minimal documentation (Figure 14.2). If we click on the Module Index link we can access the documentation for the individual modules (Figures 14.3 and 14.4).

The default website landing page.

Figure 14.2: The default website landing page.

The module index.

Figure 14.3: The module index.

The countwords documentation.

Figure 14.4: The countwords documentation.

The landing page for the website is the perfect place for the content of our README file, so we can add the line .. include:: ../README.rst to the docs/index.rst file to insert it:

Welcome to Zipf's documentation!
==================================

.. include:: ../README.rst

.. toctree::
   :maxdepth: 2
   :caption: Contents:

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

If we re-run make html, we now get an updated set of web pages that re-uses our README as the introduction to the documentation (Figure 14.5).

The new landing page showing the contents of README.rst.

Figure 14.5: The new landing page showing the contents of README.rst.

Before going on, note that Sphinx is not included in the installation requirements in requirements.txt (Section 12.9). Sphinx isn’t needed to run, develop, or even test our package, but it is needed for building the documentation. To note this requirement, but without requiring everyone installing the package to install Sphinx, let’s create a requirements_docs.txt file that contains this line (where the version number is found by running pip freeze):

Sphinx>=1.7.4

Anyone wanting to build the documentation (including us, on another computer) now only needs run pip install -r requirement_docs.txt

14.6.3 Hosting documentation online

Now that we have generated our package documentation, we need to host it online. A common option for Python projects is Read The Docs, which is a community-supported site that hosts software documentation free of charge.

Just as continuous integration systems automatically re-test things (Section 12.9), Read The Docs integrates with GitHub so that documentation is automatically re-built every time updates are pushed to the project’s GitHub repository. If we register for Read The Docs with our GitHub account, we can import a project from our GitHub repository. Read The Docs will then build the documentation (using make html just as we did earlier) and host the resulting files.

For this to work, all of the source files need to be checked into your GitHub repository: in our case, this means docs/source/*.rst, docs/Makefile, docs/conf.py, and docs/index.rst. We also need to create and save a Read the Docs configuration file in the root directory of our zipf package:

$ pwd
/Users/amira/zipf
$ cat .readthedocs.yml
# .readthedocs.yml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html 
# for details

# Required
version: 2

# Build documentation in the docs/ directory with Sphinx
sphinx:
  configuration: docs/conf.py

# Optionally set the version of Python and requirements required 
# to build your docs
python:
  version: 3.7
  install:
    - requirements: requirements.txt

The configuration file uses the now-familiar YAML format (Section 10.1 and Appendix H) to specify the location of the Sphinx configuration script (docs/conf.py) and the dependencies for our package (requirements.txt). If we named our project zipf-docs, our documentation is now available at https://zipf-docs.readthedocs.io/en/latest/.

14.7 Software Journals

As a final step to releasing our new package, we might want to give it a DOI so that it can be cited by researchers. As we saw in Section 13.2.3, GitHub integrates with Zenodo for precisely this purpose.

While creating a DOI using a site like Zenodo is often the end of the software publishing process, there is the option of publishing a journal paper to describe the software in detail. Some research disciplines have journals devoted to describing particular types of software (e.g., Geoscientific Model Development), and there are also a number of generic software journals such as the Journal of Open Research Software and the Journal of Open Source Software. Packages submitted to these journals are typically assessed against a range of criteria relating to how easy the software is to install and how well it is documented, so the peer review process can be a great way to get critical feedback from people who have seen many research software packages come and go over the years.

Once you have obtained a DOI and possibly published with a software journal, the last step is to tell users how to cite your new software package. This is traditionally done by adding a CITATION file to the associated GitHub repository (alongside README, LICENSE, CONDUCT and similar files discussed in Section 1.1.1), containing a plain text citation that can be copied and pasted into email as well as entries formatted for various bibliographic systems like BibTeX.

$ cat CITATION.md
# Citation

If you use the Zipf package for work/research presented in a
publication, we ask that you please cite:

Khan, A., and Virtanen, S., 2020. Zipf: A Python package for word
count analysis. *Journal of Important Software*, 5(51), 2317,
https://doi.org/10.21105/jois.02317

### BibTeX entry

    @article{Khan2020,
        title={Zipf: A Python package for word count analysis.},
        author={Khan, Amira and Virtanen, Sami},
        journal={Journal of Important Software},
        volume={5},
        number={51},
        eid={2317},
        year={2020},
        doi={10.21105/jois.02317},
       url={https://doi.org/10.21105/jois.02317},
    }

14.8 Summary

Thousands of people have helped write the software that our Zipf’s Law example relies on, but their work is only useful because they packaged it and documented how to use it. Doing this is increasingly recognized as a credit-worthy activity by universities, government labs, and other organizations, particularly for research software engineers. It is also deeply satisfying to make strangers’ lives better, if only in small ways.

14.9 Exercises

14.9.1 Fixing warnings

When we ran python setup.py sdist in Section 14.5, setup.py warned us about some missing metadata. Review its output and then fix the problem.

14.9.2 Separating requirements

As well as requirements_docs.txt, developers often create a requirements_dev.txt file to list packages that are not needed by the package’s users, but are required for its development and testing. Pull pytest out of requirements.txt and put it in a new requirements_dev.txt file, using pip freeze to find the minimum required version.

14.9.3 Software review

The Journal of Open Source Software has a checklist that reviewers must follow when assessing a submitted software paper. Run through the checklist (skipping the criteria related to the software paper) and see how the Zipf’s Law package would rate on each criteria.

14.9.4 Staying up to date

  1. Run pip list to get a list of the Python packages you have installed. How many are there?

  2. Run pip list -o to get a list of packages that are out of date. (This may take a few seconds.) How many are there, and how can you update them?

14.9.5 Packaging quotations

Each chapter in this book opens with a quote from the British author Terry Pratchett. This script quotes.py contains a function random_quote which prints a random Pratchett quote:

import random

quote_list = ["It's still magic even if you know how it's done.",
              "Everything starts somewhere, "\
              "though many physicists disagree.",
              "Ninety percent of most magic merely consists "\
              "of knowing one extra fact.",
              "Wisdom comes from experience. "\
              "Experience is often a result of lack of wisdom.",
              "There isn't a way things should be. "\
              "There's just what happens, and what we do.",
              "Multiple exclamation marks are a sure sign "\
              "of a diseased mind.",
              "+++ Divide By Cucumber Error. "\
              "Please Reinstall Universe And Reboot +++",
              "It's got three keyboards and a hundred extra knobs, "\
              "including twelve with ‘?' on them.",
              "Evil begins when you begin to treat people as things.",
             ]
                         
def random_quote():
    """Print a random Pratchett quote."""
    print(random.choice(quote_list))

Create a new conda development environment called pratchett and use pip to install a new package called pratchett into that environment. The package should contain quotes.py, and once the package has been installed the user should be able to run:

from pratchett import quotes

quotes.random_quote()

14.10 Key Points

  • Use setuptools to build and distribute Python packages.
  • Create a directory named mypackage containing a setup.py script as well as a subdirectory also called mypackage containing the package’s source files.
  • Use semantic versioning for software releases.
  • Use a virtual environment to test how your package installs without disrupting your main Python installation.
  • Use pip to install Python packages.
  • The default respository for Python packages is PyPI.
  • Use TestPyPI to test the distribution of your package.
  • Decide whether your documentation is for novices, competent practitioners, and/or experts.
  • Use docstrings to document modules and functions.
  • Use a README file for package-level documentation.
  • Use Sphinx to generate documentation for a package.
  • Use Read The Docs to host package documentation online.
  • Create a DOI for your package using GitHub’s Zenodo integration.
  • Publish the details of your package in a software journal so that others can cite it.