When people first started using Python for data science, installing the relevant libraries could be difficult. The main problem was that the Python package installer (pip) only worked for libraries written in pure Python. Many scientific Python libraries have C and/or Fortran dependencies, so it was left up to data scientists (who often do not have a background in system administration) to figure out how to install those dependencies themselves. To overcome this problem, a number of scientific Python distributions have been released over the years. These come with the most popular data science libraries and their dependencies pre-installed, and some also come with a package manager to assist with installing additional libraries that weren’t pre-installed. Today the most popular distribution for data science is Anaconda, which comes with a package (and environment) manager called conda.
I.1 Package management with conda
According to the latest documentation,
Anaconda comes with over 250 of the most widely used data science libraries (and their dependencies) pre-installed.
In addition, there are several thousand libraries available via the
conda install command,
which can be executed at the command line or by using the Anaconda Navigator graphical user interface.
A package manager like conda greatly simplifies the software installation process
by identifying and installing compatible versions of software and all required dependencies.
It also handles the process of updating software as more recent versions become available.
If you don’t want to install the entire Anaconda distribution,
you can install Miniconda instead.
It essentially comes with conda and nothing else.
I.1.1 Anaconda cloud
What happens if we want to install a Python package
that isn’t on the list of the few thousand or so most popular data science packages
(i.e. the ones that are automatically available via the
conda install command)?
The answer is the Anaconda Cloud website,
where the community can post conda installation packages.
The utility of the Anaconda Cloud for research software engineers is best illustrated by an example. A few years ago, an atmospheric scientist by the name of Andrew Dawson wrote a Python package called windspharm for performing computations on global wind fields in spherical geometry. While many of Andrew’s colleagues routinely process global wind fields, atmospheric science is a relatively small field and thus the windspharm package will never have a big enough user base to make the list of popular data science packages supported by Anaconda. Andrew has therefore posted a conda installation package to Anaconda Cloud (Figure I.1) so that users can install windspharm using conda:
$ conda install -c ajdawson windspharm
It turns out there are often multiple installation packages for the same library
up on Anaconda Cloud (e.g. Figure I.2).
To try and address this duplication problem conda-forge was launched,
which aims to be a central repository that contains just a single, up-to-date (and working)
version of each installation package on Anaconda Cloud.
You can therefore expand the selection of packages available via
beyond the chosen few thousand by adding the conda-forge channel to your conda configuration:
$ conda config --add channels conda-forge
The conda-forge website has instructions for adding a conda installation package to the conda-forge repository.
I.2 Environment management with conda
If you are working on several data science projects at once, installing all the libraries you need in the same place (i.e. the system default location) can become problematic. This is especially true if the projects rely on different versions of the same package, or if you are developing a new package and need to try new things. The way to avoid these issues is to create different virtual environments for different projects/tasks. The original environment manager for Python development was virtualenv, which has been more recently superseded by pipenv. The advantage that conda has over these options is that it is language agnostic (i.e. you can isolate non-Python packages in your environments too) and supports binary packages (i.e. you don’t need to compile the source code after installing), so it has become the environment manager of choice in data science. In this book conda is used to export the details of an environment when documenting the computational methodology for a report (Section 12.2) and to test how a new package installs without disturbing anything in our main Python installation (Section 13.2).