Chapter 5 Git at the Command Line

+++ Divide By Cucumber Error. Please Reinstall Universe And Reboot +++

— Terry Pratchett

A version control system records changes to files and helps people share their work with each other. These things can be done by emailing files to colleagues or by using “track changes” in Microsoft Word and Google Docs, but version control does both more accurately and efficiently. Originally developed to support software development, over the past fifteen years it has become the cornerstone of reproducible research.

A version control system stores a master copy of your code in a repository, which you can’t edit directly. Instead, you check out a working copy of the code, edit that code, then commit changes back to the repository. In this way, the system records a complete revision history (i.e. of every commit), so that you can retrieve and compare previous versions at any time. This is useful from an individual viewpoint, because you don’t need to store multiple (but slightly different) copies of the same script (Figure 5.1). It’s also useful from a collaboration viewpoint, because the system keeps a record of who made what changes and when.

Without a version control system, managing different versions of the same file can get messy.

Figure 5.1: Without a version control system, managing different versions of the same file can get messy.

There are many different version control systems, such as CVS, Subversion, and Mercurial, but the most widely used version control system today is Git. Many people first encounter it through a GUI like GitKraken or the RStudio IDE. However, these tools are actually wrappers around Git’s original command-line interface, which gives us access to all of Git’s features. This lesson describes how to perform fundamental operations using that interface; Chapter 6 then introduces more advanced operations that can be used to implement a smoother research workflow.

To show how git works, we will apply it to the Zipf’s Law project. Our project directory should currently include:

zipf/
├── bin
│   ├── book_summary.sh
│   ├── collate.py
│   ├── countwords.py
│   ├── plotcounts.py
│   ├── script_template.py
│   └── utilities.py
├── data
│   ├── README.md
│   ├── dracula.txt
│   ├── frankenstein.txt
│   └── ...
└── results
    ├── dracula.csv
    ├── jane_eyre.csv
    ├── jane_eyre.png
    └── moby_dick.csv

bin/plotcounts.py is the solution to Exercise 4.11.2; over the course of this chapter we will edit it to produce more informative plots. Initially, it looks like this:

"""Plot word counts."""

import argparse
import pandas as pd


def main(args):
    df = pd.read_csv(args.infile, header=None, names=('word', 'word_frequency'))
    df['rank'] = df['word_frequency'].rank(ascending=False, method='max')
    df['inverse_rank'] = 1 / df['rank']
    ax = df.plot.scatter(x='word_frequency', y='inverse_rank',
                         figsize=[12, 6], grid=True)
    ax.figure.savefig(args.outfile)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=argparse.FileType('r'), nargs='?',
                        default='-', help='Word count csv file name')
    parser.add_argument('--outfile', type=str, default='plotcounts.png',
                        help='Output image file name')
    parser.add_argument('--xlim', type=float, nargs=2, metavar=('XMIN', 'XMAX'),
                        default=None, help='X-axis limits')
    args = parser.parse_args()
    main(args)

5.1 Setting Up

We write Git commands as git VERB [options], where the subcommand VERB tells Git what we want to do and [options] provide whatever additional information that subcommand needs. Using this syntax, the first thing we need to do is configure Git.

$ git config --global user.name "Amira Khan"
$ git config --global user.email "amira@zipf.org"

(Please use your own name and email address instead of the one shown.) Here, config is the verb and the rest of the command are options. We put the name in quotation marks because it contains a space; we don’t actually need to quote the email address, but do so for consistency. Since we are going to be using GitHub, the email address should be the same as you have or intend to use when setting up your GitHub account.

The --global option tells Git to use the settings for all of our projects on this computer, so these two commands only need to be run once. However, we can re-run them any time if we want to change our details. We can also check our settings using the --list option:

$ git config --list
user.name=Amira Khan
user.email=amira@zipf.org
core.autocrlf=input
core.editor=nano
core.repositoryformatversion=0
core.filemode=true
core.bare=false
core.ignorecase=true
...

Git Help and Manual

If we forget a Git command, we can list which ones are available using --help:

$ git --help

This option also gives us more information about specific commands:

$ git config --help

5.2 Creating a New Repository

Once Git is configured, we can use it to track work on our Zipf’s Law project. First, we need to make sure we are in the top-level directory of our project:

$ cd ~/zipf
$ ls
 bin       data      results

We want to make this directory a repository, i.e., a place where Git can store versions of our files. We do this using the init command with . to mean “the current directory”:

$ git init .
Initialized empty Git repository in /Users/amira/zipf/.git/

ls seems to show that nothing has changed:

$ ls
bin     data    results

But if we add the -a flag to show everything, we can see that Git has created a hidden directory within zipf called .git:

$ ls -a
.       ..      .git    bin     data    results

Git stores information about the project in this special subdirectory. If we ever delete it, we will lose that history.

We can check that everything is set up correctly by asking Git to tell us the status of our project:

$ git status
On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        bin/
        data/
        results/

nothing added to commit but untracked files present (use "git add" to track)

“No commits yet” means that Git hasn’t recorded any history yet, while “Untracked files” means Git has noticed that there are things in bin/, data/ and results/ that it is not yet keeping track of.

Hints from Git

After executing Git commands, you may see messages output that differ slightly from what is printed here. For example, you may see a reference to git restore after executing the command above. This is because newer versions of Git (>=2.23.0) include commands that streamline some common tasks. The commands presented here will still work, and you can consider any deviation you see to be a reminder to continue checking the documentation (e.g., git restore --help) to learn how new features can help your workflow.

5.3 Adding Existing Work

Now that our project is a repository, we can tell Git to start recording its history. To do this, we add things to the list of things Git is tracking using git add. We can do this for single files:

$ git add bin/countwords.py

or entire directories:

$ git add bin

The easiest thing to do with an existing project is to tell Git to add everything in the current directory using .:

$ git add .

We can then check the repository’s status to see what files have been added:

$ git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

        new file:   bin/book_summary.sh
        new file:   bin/collate.py
        new file:   bin/countwords.py
        new file:   bin/plotcounts.py
        new file:   bin/script_template.py
        new file:   bin/utilities.py
        new file:   data/README.md
        new file:   data/dracula.txt
        new file:   data/frankenstein.txt
        new file:   data/jane_eyre.txt
        new file:   data/moby_dick.txt
        new file:   data/sense_and_sensibility.txt
        new file:   data/sherlock_holmes.txt
        new file:   data/time_machine.txt
        new file:   results/dracula.csv
        new file:   results/jane_eyre.csv
        new file:   results/moby_dick.csv
        new file:   results/jane_eyre.png

Adding all of our existing files this way is easy, but we can accidentally add things that should never be in version control, such as files containing passwords or other sensitive information. The output of git status tells us that we can remove such files from the list of things to be saved using git rm --cached; we will practice this in Exercise 5.11.2.

What to Save

We always want to save programs, manuscripts, and everything else we have created by hand in version control. In this project, we have also chosen to save our data files and the results we have generated (including our plots). This is a project-specific decision: if these files are very large, for example, we may decide to save them elsewhere, while if they are easy to re-create, we may not save them at all. We will explore this issue further in Chapter 12.

We no longer have any untracked files, but the tracked files haven’t been committed (i.e., saved permanently in our project’s history). We can do this using git commit:

$ git commit -m "Add scripts, novels, word counts, and word rank plot"
[master (root-commit) 31a216a] Add scripts, novels, word counts, and word rank plot
 17 files changed, 240337 insertions(+)
 create mode 100644 bin/book_summary.sh
 create mode 100644 bin/collate.py
 create mode 100755 bin/countwords.py
 create mode 100644 bin/plotcounts.py
 create mode 100644 bin/script_template.py
 create mode 100644 bin/utilities.py
 create mode 100644 data/README.md
 create mode 100644 data/dracula.txt
 create mode 100644 data/frankenstein.txt
 create mode 100644 data/jane_eyre.txt
 create mode 100644 data/moby_dick.txt
 create mode 100644 data/sense_and_sensibility.txt
 create mode 100644 data/sherlock_holmes.txt
 create mode 100644 data/time_machine.txt
 create mode 100644 results/dracula.csv
 create mode 100644 results/jane_eyre.csv
 create mode 100644 results/jane_eyre.png
 create mode 100644 results/moby_dick.csv

git commit takes everything we have told Git to save using git add and stores a copy permanently inside the repository’s .git directory. This permanent copy is called a commit or a revision. Git gives is a unique identifier, and the first line of output from git commit displays its short identifier 31a216a, which is the first few characters of that unique label.

We use the -m option (short for message) to record a short comment with the commit to remind us later what we did and why. (Once again, we put it in double quotes because it contains spaces.) If we run git status now:

$ git status

the output tells us that all of our existing work is tracked and up to date:

On branch master
nothing to commit, working tree clean

This first commit becomes the starting point of our project’s history: we won’t be able to see changes made before this point. This implies that we should make our project a Git repository as soon as we create it rather than after we have done some work.

5.4 Describing Commits

If we run git commit without the -m option, Git opens a text editor so that we can write a longer commit message. In this message, the first line is referred to as the “subject” and the rest as the “body”, just as in an email.

When we use -m, we are only writing the subject line; this makes things easier in the short run, but if our project’s history fills up with one-liners like “Fixed problem” or “Updated”, our future self will wish that we had taken a few extra seconds to explain things in a little more detail. Following these guidelines will help:

  1. Separate the subject from the body with a blank line so that it is easy to spot.
  2. Limit subject lines to 50 characters so that they are easy to scan.
  3. Write the subject line in Title Case (like a section heading).
  4. Do not end the subject line with a period.
  5. Write as if giving a command (e.g., “Make each plot half the width of the page”).
  6. Wrap the body (i.e., insert line breaks to format text as paragraphs rather than relying on editors to wrap lines automatically).
  7. Use the body to explain what and why rather than how.

Which Editor?

The default editor in the Unix shell is called Vim. It has many useful features, but no one has ever claimed that its interface is intuitive. (“How do I exit the Vim editor?” is one of the most frequently read questions on Stack Overflow.) Section E.2 explains how to configure Git to use the nano editor introduced in Chapter 2 instead.

5.5 Saving and Tracking Changes

Our initial commit gave us a starting point. The process to build on top of it is similar: first add the file, then commit changes. Let’s check that we’re in the right directory:

$ pwd
/Users/amira/zipf

Let’s use plotcounts.py to plot the word counts in results/dracula.csv:

python bin/plotcounts.py results/dracula.csv --outfile results/dracula.png

If we check the status of our repository again, Git tells us that we have a new file:

$ git status
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

        results/dracula.png

nothing added to commit but untracked files present (use "git add" to track)

Git isn’t tracking this file yet because we haven’t told it to. Let’s do that with git add and then commit our change:

$ git add results/dracula.png
$ git commit -m "Add plot of word counts for 'Dracula'"
[master 65b7e61] Add plot of word counts for 'Dracula'
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 results/dracula.png

If we want to know what we’ve done recently, we can display the project’s history using git log:

$ git log
commit 65b7e6129f978f6b99bae6b16c5704a9ce079afa (HEAD -> master)
Author: Amira Khan <amira@zipf.org>
Date:   Thu Feb 20 10:46:19 2020 -0800

    Add plot of word counts for 'Dracula'

commit 31a216a6119de9a8d2233e5e275af9a2967415af
Author: Amira Khan <amira@zipf.org>
Date:   Wed Feb 19 15:39:04 2020 -0800

    Add scripts, novels, word counts, and word rank plot

git log lists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by git commit), the commit’s author, when it was created, and the commit message that we wrote.

The plot we have made is shown in Figure 5.2. It could be better: most of the visual space is devoted to a few very common words, which makes it hard to see what is happening with the other ten thousand or so words.

Inverse rank versus word frequency for Dracula

Figure 5.2: Inverse rank versus word frequency for Dracula

An alternative way to visually evaluate Zipf’s Law is to plot the word frequency against rank on log-log axes. Let’s change the line:

    ax = df.plot.scatter(x='word_frequency', y='inverse_rank',
                         figsize=[12, 6], grid=True, xlim=args.xlim)

to put 'rank' on the y-axis and add loglog=True:

    ax = df.plot.scatter(x='word_frequency', y='rank', loglog=True,
                         figsize=[12, 6], grid=True, xlim=args.xlim)

When we run git status now, it prints:

$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   bin/plotcounts.py

no changes added to commit (use "git add" and/or "git commit -a")

The last line tells us that a file Git already knows about has been modified. To save those changes in the repository’s history, we must git add and then git commit. Before we do, though, let’s review the changes using git diff. This command shows us the differences between the current state of our repository and the most recently saved version:

$ git diff
diff --git a/bin/plotcounts.py b/bin/plotcounts.py
index 13e7f38..a6005cd 100644
--- a/bin/plotcounts.py
+++ b/bin/plotcounts.py
@@ -8,7 +8,7 @@ def main(args):
     df = pd.read_csv(args.infile, header=None, names=('word', 'word_frequency'))
     df['rank'] = df['word_frequency'].rank(ascending=False, method='max')
     df['inverse_rank'] = 1 / df['rank']
-    df.plot.scatter(x='word_frequency', y='inverse_rank',
+    ax = df.plot.scatter(x='word_frequency', y='rank', loglog=True,
                          figsize=[12, 6], grid=True, xlim=args.xlim)
     ax.figure.savefig(args.outfile)

The output is cryptic, even by the standards of the Unix command line, because it is actually a series of commands telling editors and other tools how to turn the file we had into the file we have. If we break it down into pieces:

  1. The first line tells us that Git is producing output in the format of the Unix diff command.
  2. The second line tells exactly which versions of the file Git is comparing: 13e7f38 and a6005cd are the short identifiers for those versions.
  3. The third and fourth lines once again show the name of the file being changed; the name appears twice in case we are renaming a file as well as modifying it.
  4. The remaining lines show us the changes and the lines on which they occur. A minus sign - in the first column indicates a line that is being removed, while a plus sign + shows a line that is being added. Lines without either plus or minus signs have not been changed, but are provided around the lines that have been changed to add context.

To be specific, this diff tells us that this line in the file was removed:

    df.plot.scatter(x='word_frequency', y='inverse_rank',

and this line was added:

    df.plot.scatter(x='word_frequency', y='rank', loglog=True,

Git’s default is to compare line by line, but it can be instructive to instead compare word by word using the --word-diff or --color-words options. These are particularly useful when running git diff on prose rather than code.

After reviewing our change we can commit it just as we did before:

$ git commit -m "Edit to plot frequency against rank on log-log axes"
On branch master
Changes not staged for commit:
        modified:   bin/plotcounts.py

no changes added to commit

Whoops: we forgot to add the file to the set of things we want to commit. Let’s do that and then try the commit again:

$ git add bin/plotcounts.py
$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        modified:   bin/plotcounts.py
$ git commit -m "Edit to plot frequency against rank on log-log axes"
[master b5176bf] Edit to plot frequency against rank on log-log axes
 1 file changed, 1 insertion(+), 1 deletion(-)

The Staging Area

Git insists that we add files to the set we want to commit before actually committing anything. This allows us to commit our changes in stages and capture changes in logical portions rather than only large batches. For example, suppose we add a few citations to the introduction of our thesis, which is in the file introduction.tex. We might want to commit those additions but not commit the changes to conclusion.tex (which we haven’t finished writing yet). To allow for this, Git has a special staging area where it keeps track of things that have been added to the current changeset but not yet committed (Figure 5.3).

The staging area.

Figure 5.3: The staging area.

Let’s take a look at our new plot (Figure 5.4):

python bin/plotcounts.py results/dracula.csv --outfile results/dracula.png
Rank versus word frequency, on log-log axes, for Dracula

Figure 5.4: Rank versus word frequency, on log-log axes, for Dracula

Interpreting Our Plot

If Zipf’s Law holds, we should still see a linear relationship, although now it will be negative, rather than positive (since we’re plotting the rank instead of the reverse rank). The low-frequency words (below about 120 instances) seem to follow a straight line very closely, but we currently have to make this evaluation by eye. In the next chapter, we’ll write code to fit and add a line to our plot.

Running git status again shows that our plot has been modified:

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   results/dracula.png

no changes added to commit (use "git add" and/or "git commit -a")

Since results/dracula.png is a binary file rather than text, git diff can’t show what has changed. It therefore simply tells us that the new file is different from the old one:

diff --git a/results/dracula.png b/results/dracula.png
index d8162ac..e9fe7f8 100644
Binary files a/results/dracula.png and b/results/dracula.png differ

This is one of the biggest weaknesses of Git (and other version control systems): they are built to handle text. They can track changes to images, PDFs, and other formats, but they cannot do as much to show or merge differences. In a better world than ours, programmers fixed this years ago.

If we are sure we want to save all of our changes, we can add and commit in a single command by giving git commit the -a option:

$ git commit -a -m "Update dracula plot"
[master d77bc5c] Update dracula plot
 1 file changed, 0 insertions(+), 0 deletions(-)
 rewrite results/dracula.png (99%)

5.6 Synchronizing with Other Repositories

Sooner or later our computer will experience a hardware failure, be stolen, or be thrown in the lake by someone who thinks that we shouldn’t spend the entire vacation working on our thesis. Even before that happens we will probably want to collaborate with others, which we can do by linking our local repository to one stored on a hosting service such as GitHub.

Where’s My Repository? So far we’ve worked with repositories located on your own computer, which we’ll also refer to as local or desktop repositories. The alternative is hosting repositories on GitHub or another server, which we’ll refer to as a remote or GitHub repository.

The first steps are to create an account on GitHub, and then to create a new repository to synchronize with. The remote repository doesn’t have to have the same name as the local one, but we will probably get confused if they are different, so the repository we create on GitHub will also be called zipf.

Next, we need to connect our desktop repository with the one on GitHub. We do this by making the GitHub repository a remote of the local repository. The home page of the repository on GitHub includes the string we need to identify it (Figure 5.5).

Where to Find the Repository Link

Figure 5.5: Where to Find the Repository Link

We can click on “HTTPS” to change the URL from SSH to HTTPS and then copy that URL.

HTTPS vs. SSH

We use HTTPS here because it does not require additional configuration. If we want to set up SSH access so that we do not have to type in our password as often, the tutorials from GitHub, Bitbucket, or GitLab explain the steps required.

Next, let’s go into the local zipf repository and run this command:

$ cd ~/zipf
$ git remote add origin https://github.com/amira-khan/zipf.git

Make sure to use the URL for your repository instead of the one shown: the only difference should be that it includes your username instead of amira-khan.

A Git remote is like a bookmark: it gives a short name to a URL. In this case the remote’s name is origin; we could use anything we want, but origin is Git’s default, so we will stick with it. We can check that the command has worked by running git remote -v (where the -v option is short for verbose):

$ git remote -v
origin  https://github.com/amira-khan/zipf.git (fetch)
origin  https://github.com/amira-khan/zipf.git (push)

Git displays two lines because it’s actually possible to set up a remote to download from one URL but upload to another. Sensible people don’t do this, so we won’t explore this possibility any further.

Now that we have configured a remote, we can push the work we have done so far to the repository on GitHub:

$ git push origin master

This may prompt us to enter our username and password; once we do that, Git prints a few lines of administrative information:

Enumerating objects: 33, done.
Counting objects: 100% (33/33), done.
Delta compression using up to 4 threads
Compressing objects: 100% (33/33), done.
Writing objects: 100% (33/33), 2.12 MiB | 799.00 KiB/s, done.
Total 33 (delta 5), reused 0 (delta 0)
remote: Resolving deltas: 100% (5/5), done.
To https://github.com/amira-khan/zipf.git
 * [new branch]      master -> master
Branch 'master' set up to track remote branch 'master' from 'origin'.

If we view our GitHub repository in the browser, it now includes all of our project files, along with all of the commits we have made so far (Figure 5.6).

Repository history on GitHub

Figure 5.6: Repository history on GitHub

We can also pull changes from the remote repository to the local one:

$ git pull origin master
From https://github.com/amira-khan/zipf
 * branch            master     -> FETCH_HEAD
Already up-to-date.

Pulling has no effect in this case because the two repositories are already synchronized.

FIXME: Do we need a figure similar to this (removing some of the commands that aren't relevant to this chapter)?

Figure 5.7: FIXME: Do we need a figure similar to this (removing some of the commands that aren’t relevant to this chapter)?

5.7 Exploring History

Git lets us look at previous versions of files and restore specific files to earlier states if we want to. In order to do these things, we need to identify the versions we want.

The two ways to do this are analogous to absolute and relative paths. The “absolute” version is the unique identifier that Git gives to each commit. These identifiers are 40 characters long, but in most situations Git will let us use just the first half dozen characters or so. For example, if we run git log right now, it shows us something like this:

commit d77bc5cc204f3140d95f942d0515b143927c6f51 (HEAD -> master, origin/master)
Author: Amira Khan <amira@zipf.org>
Date:   Thu Feb 20 11:44:54 2020 -0800

    Update dracula plot

commit b5176bfd2ce9650ad5e79e117cd68a666c9cdabc
Author: Amira Khan <amira@zipf.org>
Date:   Thu Feb 20 11:18:33 2020 -0800

    Edit to plot frequency against rank on log-log axes

commit 65b7e6129f978f6b99bae6b16c5704a9ce079afa
Author: Amira Khan <amira@zipf.org>
Date:   Thu Feb 20 10:46:19 2020 -0800

    Add plot of word counts for 'Dracula'

commit 31a216a6119de9a8d2233e5e275af9a2967415af
Author: Amira Khan <amira@zipf.org>
Date:   Wed Feb 19 15:39:04 2020 -0800

    Add scripts, novels, word counts, and word rank plot

The commit in which we changed plotcounts.py has the absolute identifier b5176bfd2ce9650ad5e79e117cd68a666c9cdabc, but we can use b5176bf to reference it in almost all situations.

While git log includes the commit message, it doesn’t tell us exactly what changes were made in each commit. If we add the -p option (short for patch), we get the same kind of details git diff provides to describe the changes in each commit:

git log -p

The first part of the output is shown below; we have truncated the rest, since it is very long:

commit d77bc5cc204f3140d95f942d0515b143927c6f51 (HEAD -> master, origin/master)
Author: Amira Khan <amira@zipf.org>
Date:   Thu Feb 20 11:44:54 2020 -0800

    Update dracula plot

diff --git a/results/dracula.png b/results/dracula.png
index 8e3ff84..af8e892 100644
Binary files a/results/dracula.png and b/results/dracula.png differ
...

Alternatively, we can use git diff directly to examine the differences between files at any stage in the repository’s history. Let’s explore this with the plotcounts.py file. We no longer need the line of code in plotcounts.py that calculates the inverse rank:

df['inverse_rank'] = 1 / df['rank']

If we delete that line from bin/plotcounts.py, git diff on its own will show the difference between the file as it is now and the most recent version:

diff --git a/bin/plotcounts.py b/bin/plotcounts.py
index a6005cd..d085f22 100644
--- a/bin/plotcounts.py
+++ b/bin/plotcounts.py
@@ -7,7 +7,6 @@ def main(args):
     """Run the command line program."""
     df = pd.read_csv(args.infile, header=None, names=('word', 'word_frequency'))
     df['rank'] = df['word_frequency'].rank(ascending=False, method='max')
-    df['inverse_rank'] = 1 / df['rank']
     ax = df.plot.scatter(x='word_frequency', y='rank', loglog=True,
                         figsize=[12, 6], grid=True, xlim=args.xlim)
     ax.figure.savefig(args.outfile)

git diff b5176bf, on the other hand, shows the difference between the current state and the commit referenced by the short identifier:

diff --git a/bin/plotcounts.py b/bin/plotcounts.py
index a6005cd..04e824d 100644
--- a/bin/plotcounts.py
+++ b/bin/plotcounts.py
@@ -7,8 +7,7 @@ def main(args):
     """Run the command line program."""
     df = pd.read_csv(args.infile, header=None, names=('word', 'word_frequency'))
     df['rank'] = df['word_frequency'].rank(ascending=False, method='max')
-    df['inverse_rank'] = 1 / df['rank']
-    df.plot.scatter(x='word_frequency', y='rank', loglog=True,
+    ax = df.plot.scatter(x='word_frequency', y='inverse_rank',
                          figsize=[12, 6], grid=True, xlim=args.xlim)
     ax.figure.savefig(args.outfile)

diff --git a/results/dracula.png b/results/dracula.png
index d8162ac..e9fe7f8 100644
Binary files a/results/dracula.png and b/results/dracula.png differ

Note that you may need to use something other than b5176bf, since Git may have assigned your commit a different unique identifier. Note also that we have not committed this change: we will look at ways of undoing it in the next section.

The “relative” version of history relies on a special identifier called HEAD, which always refers to the most recent version in the repository. git diff HEAD therefore shows the same thing as git diff, but instead of typing in a version identifier to back up one commit, we can use HEAD~1 (where ~ is the tilde symbol). This shorthand is read “HEAD minus one”, and gives us the difference to the previous saved version. git diff HEAD~2 goes back two revisions and so on. We can also look at the differences between two saved versions by separating their identifiers with two dots .. like this:

$ git diff HEAD~1..HEAD~2
diff --git a/bin/plotcounts.py b/bin/plotcounts.py
index a6005cd..13e7f38 100644
--- a/bin/plotcounts.py
+++ b/bin/plotcounts.py
@@ -8,7 +8,7 @@ def main(args):
     df = pd.read_csv(args.infile, header=None, names=('word', 'word_frequency'))
     df['rank'] = df['word_frequency'].rank(ascending=False, method='max')
     df['inverse_rank'] = 1 / df['rank']
-    df.plot.scatter(x='word_frequency', y='rank', loglog=True,
+    ax = df.plot.scatter(x='word_frequency', y='inverse_rank',
                          figsize=[12, 6], grid=True, xlim=args.xlim)
     ax.figure.savefig(args.outfile)

If we want to see the changes made in a particular commit, we can use git show with an identifier and a file joined by a colon:

$ git show HEAD~1:bin/plotcounts.py
ommit b5176bfd2ce9650ad5e79e117cd68a666c9cdabc
Author: Amira Khan <amira@zipf.org>
Date:   Thu Feb 20 11:18:33 2020 -0800

    Edit to plot frequency against rank on log-log axes

diff --git a/bin/plotcounts.py b/bin/plotcounts.py
index 13e7f38..a6005cd 100644
--- a/bin/plotcounts.py
+++ b/bin/plotcounts.py
@@ -8,7 +8,7 @@ def main(args):
     df = pd.read_csv(args.infile, header=None, names=('word', 'word_frequency'))
     df['rank'] = df['word_frequency'].rank(ascending=False, method='max')
     df['inverse_rank'] = 1 / df['rank']
-    df.plot.scatter(x='word_frequency', y='inverse_rank',
+    ax = df.plot.scatter(x='word_frequency', y='rank', loglog=True,
                          figsize=[12, 6], grid=True, xlim=args.xlim)
     ax.figure.savefig(args.outfile)

5.8 Restoring Old Versions of Files

We can see what we changed, but how can we restore it? Suppose we change our mind about the last update to bin/plotcounts.py before we add it or commit it. git status tells us that the file has been changed, but those changes haven’t been staged:

$ git status
On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   bin/plotcounts.py

no changes added to commit (use "git add" and/or "git commit -a")

We can put things back the way they were in the last saved revision using git checkout:

$ git checkout HEAD bin/plotcounts.py
$ git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean
$ head -12 bin/plotcounts.py | tail -4
    df['rank'] = df['word_frequency'].rank(ascending=False, method='max')
    df['inverse_rank'] = 1 / df['rank']
    df.plot.scatter(x='word_frequency', y='rank', loglog=True,
                    figsize=[12, 6], grid=True, xlim=args.xlim)

As its name suggests, git checkout checks out (i.e., restores) an old version of a file. In this case, we told Git to recover the version of the file saved in the most recent commit. We can use a specific commit identifier rather than HEAD to go back as far as we want:

$ git checkout 65b7e61 bin/countwords.py

Doing this does not change the history: git log still shows our four commits. Instead, it replaces the content of the file with the old content:

$ git status
On branch master
Your branch is up to date with 'origin/master'.

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        modified:   bin/plotcounts.py

Notice that the changes have already been added to the staging area for new commits. If we change our mind again, we can return the file to the state of the most recent commit using git checkout:

$ git checkout HEAD bin/countwords.py
$ git status
On branch master
Your branch is up to date with 'origin/master'.

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        modified:   bin/plotcounts.py
$ head -12 bin/plotcounts.py | tail -4
    df['rank'] = df['word_frequency'].rank(ascending=False, method='max')
    df['inverse_rank'] = 1 / df['rank']
    df.plot.scatter(x='word_frequency', y='rank', loglog=True,
                    figsize=[12, 6], grid=True, xlim=args.xlim)

Since we didn’t commit the change that removed the line that calculates the inverse rank, that work is now lost: Git can only go back and forth between committed versions of files.

5.9 Ignoring Files

We don’t always want Git to track every file’s history. For example, we might want to track text files with names ending in .txt but not data files with names ending in .dat.

To stop Git from telling us about these files every time we call git status, we can create a file in the root directory of our project called .gitignore. This file can contain filenames like thesis.pdf or wildcard patterns like *.dat. Each must be on a line of its own, and Git will ignore anything that matches any of these lines. For now we only need one entry in our .gitignore file,

__pycache__

which tells Git to ignore any __pycache__ directory created by Python (Section 4.8).

Remember to Ignore

Don’t forget to commit .gitignore to your repository so that Git knows to use it.

5.10 Summary

The biggest benefit of version control for individual research is that we can always go back to the precise set of files that we used to produce a particular result. While Git is complex (Perez De Rosso and Jackson 2013), being able to back up our changes on sites like GitHub with just a few keystrokes can save us a lot of pain, and some of Git’s advanced features make it even more powerful. We will explore these in the next chapter.

5.11 Exercises

5.11.1 Places to create Git repositories

Along with information about the Zipf’s Law project, Amira would also like to keep some notes on Heaps’ Law. Despite her colleagues’ concerns, Amira creates a heaps-law project inside her zipf project as follows:

$ cd ~/zipf         # go into zipf directory, which is already a Git repository
$ mkdir heaps-law   # make a subdirectory zipf/heaps-law
$ cd heaps-law      # go into heaps-law subdirectory
$ git heaps-law     # make heaps-law a Git repository

Is the git init command that she runs inside the heaps-law subdirectory required for tracking files stored there?

5.11.2 Removing before saving

The output of git status tells us that we can take files out of the list of things to be saved using git rm --cached. Try this out:

  1. Create a new file in the repository called example.txt.
  2. Use git add example.txt to add this file.
  3. Use git status to check that Git has noticed it.
  4. Use git rm --cached example.txt to remove it from the list of things to be saved.

What does git status now show? What (if anything) has happened to the file?

5.11.3 Viewing changes

Make a few changes to a file in your Git repository, then view those differences using both git diff and git diff --word-diff. Which output do you find easiest to understand?

5.11.4 Committing changes

Which command(s) below would save changes to myfile.txt to a local Git repository?

  1.    $ git commit -m "Add recent changes"
  2.    $ git init myfile.txt
       $ git commit -m "Add recent changes"
  3.    $ git add myfile.txt
       $ git commit -m "Add recent changes"
  4.    $ git commit -m myfile.txt "Add recent changes"

5.11.5 Committing multiple files

The staging area can hold changes from any number of files that you want to commit as a single snapshot.

  1. Create a new file about.txt and add a one sentence summary of the project.

  2. Create another new file project-members.txt and add your name.

  3. Add changes from both files to the staging area and commit those changes.

5.11.6 Write your biography

  1. Create a new Git repository on your computer called bio.
  2. Write a three-line biography for yourself in a file called me.txt and commit your changes.
  3. Modify one line and add a fourth line.
  4. Display the differences between the file’s original state and its updated state.

5.11.7 Ignoring nested files

Suppose our project has a directory results with two subdirectories called data and plots. How would we ignore all of the files in results/plots but not ignore files in results/data?

5.11.8 Including specific files

How would you ignore all .dat files in your root directory except for final.dat? (Hint: find out what the exclamation mark ! means in a .gitignore file.)

5.11.9 Exploring the GitHub interface

Browse to your zipf repository on GitHub. Under the Code tab, find and click on the text that says “NN commits” (where “NN” is some number). Hover over and click on the three buttons to the right of each commit. What information can you gather/explore from these buttons? How would you get that same information in the shell?

5.11.10 GitHub timestamps

  1. Create a remote repository on GitHub.
  2. Push the contents of your local repository to the remote.
  3. Make changes to your local repository and push these changes as well.
  4. Go to the repo you just created on GitHub and check the timestamps of the files.

How does GitHub record times, and why?

5.11.11 Push versus commit

Explain in one or two sentences how git push is different from git commit.

5.11.12 License and README files

When we initialized our GitHub repo, we didn’t add a README.md or license file. If we had, what would have happened when we tried to link our local and remote repositories?

5.11.13 Recovering older versions of a file

Amira made changes this morning to a shell script called data_cruncher.sh that she has been working on for weeks. Her changes broke the script, and she has now spent an hour trying to get it back in working order. Luckily, she has been keeping track of her project’s versions using Git. Which of the commands below can she use to recover the last committed version of her script?

  1. $ git checkout HEAD
  2. $ git checkout HEAD data_cruncher.sh
  3. $ git checkout HEAD~1 data_cruncher.sh
  4. $ git checkout <unique ID of last commit> data_cruncher.sh
  5. Both 2 and 4

5.11.14 Workflow and history

What is the output of the last command in the sequence below?

$ cd zipf
$ echo "Zipf's Law describes the relationship between the frequency and rarity of words." > motivation.txt
$ git add motivation.txt
$ echo "Zipf's Law suggests the frequency of any word is inversely proportional to its rank." > motivation.txt
$ git commit -m "Motivate project"
$ git checkout HEAD motivation.txt
$ cat motivation.txt
    Zipf's Law describes the relationship between the frequency and rarity of words.
    Zipf's Law suggests the frequency of any word is inversely proportional to its rank.
   Zipf's Law describes the relationship between the frequency and rarity of words.
   Zipf's Law suggests the frequency of any word is inversely proportional to its rank.
  1. An error message because we have changed motivation.txt without committing first.

5.11.15 Understanding git diff

  1. What will the command git diff HEAD~2 bin/plotcounts.py do if we run it?
  2. What does it actually do?
  3. What does git diff HEAD bin/plotcounts.py do?

5.11.16 Getting rid of staged changes

git checkout can be used to restore a previous commit when unstaged changes have been made, but will it also work for changes that have been staged but not committed? To find out:

  1. Change bin/plotcounts.py.
  2. Use git add on those changes to bin/plotcounts.py.
  3. Use git checkout to see if you can remove your change.

Does it work?

5.11.17 Figuring out who did what

Run the command git blame bin/plotcounts.py. What does each line of the output show?

5.12 Key Points

  • Use git config with the --global option to configure your user name, email address, and other preferences once per machine.
  • git init initializes a repository.
  • Git stores all repository management data in the .git subdirectory of the repository’s root directory.
  • git status shows the status of a repository.
  • git add puts files in the repository’s staging area.
  • git commit saves the staged content as a new commit in the local repository.
  • git log lists previous commits.
  • git diff shows the difference between two versions of the repository.
  • Synchronize your local repository with a remote repository on a forge such as GitHub.
  • git remote manages bookmarks pointing at remote repositories.
  • git push copies changes from a local repository to a remote repository.
  • git pull copies changes from a remote repository to a local repository.
  • git checkout recovers old versions of files.
  • The .gitignore file tells Git what files to ignore.