Chapter 6 Advanced Git

It’s got three keyboards and a hundred extra knobs, including twelve with ‘?’ on them.

— Terry Pratchett

Git would be worth using if all it did was keep track of our work, but two of its more advanced features allow us to do much more. Branches let us work on multiple things simultaneously in a single repository; pull requests (PRs) let us submit our work for review, get feedback, and make updates. Used together, they allow us to go through the write-review-revise cycle familiar to anyone who has ever written a journal paper in hours rather than weeks.

Your zipf project directory should now include:

zipf/
├── .gitignore
├── bin
│   ├── book_summary.sh
│   ├── countwords.py
│   ├── collate.py
│   ├── plotcounts.py
│   ├── script_template.py
│   └── utilities.py
├── data
│   ├── README.md
│   ├── dracula.txt
│   ├── frankenstein.txt
│   └── ...
└── results
    ├── dracula.csv
    ├── dracula.png
    ├── jane_eyre.csv
    ├── jane_eyre.png
    └── moby_dick.csv

All of these files should also be tracked in your version history. We’ll use them and some additional analyses to explore Zipf’s Law using Git’s advanced features.

6.1 What’s a Branch?

So far we have only used a sequential timeline with Git: each change builds on the one before, and only on the one before. However, there are times when we want to try things out without disrupting our main work. To do this, we can use branches to work on separate tasks in parallel. Each branch is a parallel timeline; changes made on the branch only affect that branch unless and until we explicitly combine them with work done in another branch.

We can see what branches exist in a repository using this command:

$ git branch
* master

When we initialize a repository, Git automatically creates a branch called master. It is often considered the “official” version of the repository. The asterisk * indicates that it is currently active, i.e., that all changes we make will take place in this branch by default. (The active branch is like the current working directory in the shell.)

In the previous chapter, we foreshadowed some experimental changes that we could try and make to plotcounts.py.

Making sure our project directory is our working directory, we can inspect our current plotcounts.py:

$ cd ~/zipf
$ cat bin/plotcounts.py
"""Plot word counts."""

import argparse
import pandas as pd


def main(args):
    """Run the command line program."""
    df = pd.read_csv(args.infile, header=None, names=('word', 'word_frequency'))
    df['rank'] = df['word_frequency'].rank(ascending=False, method='max')
    ax = df.plot.scatter(x='word_frequency', y='rank', loglog=True,
                         figsize=[12, 6], grid=True, xlim=args.xlim)
    ax.figure.savefig(args.outfile)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=argparse.FileType('r'), nargs='?',
                        default='-', help='Word count csv file name')
    parser.add_argument('--outfile', type=str, default='plotcounts.png',
                        help='Output image file name')
    parser.add_argument('--xlim', type=float, nargs=2, metavar=('XMIN', 'XMAX'),
                        default=None, help='X-axis limits')
    args = parser.parse_args()
    main(args)

We used this version of plotcounts.py to display the word counts for Dracula on a log-log plot (Figure 5.4). The relationship between word count and rank looked linear, but since the eye is easily fooled, we should fit a curve to the data. Doing this will require more than just a trivial change to the script, so to ensure that this version of plotcounts.py keeps working while we try to build a new one, we will do our work in a separate branch. Once we have successfully added curve fitting to plotcounts.py, we can decide if we want to merge our changes back into the master branch.

6.2 Creating a Branch

To create a new branch called fit, we run:

$ git branch fit

We can check that the branch exists by running git branch again:

$ git branch
* master
  fit

Our branch is there, but the asterisk * shows that we are still in the master branch. (By analogy, creating a new directory doesn’t automatically move us into that directory.) As a further check, let’s see what our repository’s status is:

$ git status
On branch master
nothing to commit, working directory clean

To switch to our new branch we can use the checkout command that we first saw in Section 5.8:

$ git checkout fit
$ git branch
  master
* fit

git checkout doesn’t just check out a file from a specific commit: it can also check out the whole repository, i.e., switch it from one saved state to another. We should choose the name to signal the purpose of the branch, just as we choose the names of files and variables to indicate what they are for. We haven’t made any changes since switching to the fit branch, so at this point master and fit are at the same point in the repository’s history. Commands like ls and git log therefore show that the files and history haven’t changed.

Where Are Branches Saved?

Git saves every version of every file in the .git directory that it creates in the project’s root directory. When we switch from one branch to another, it replaces the files we see with their counterparts from the branch we’re switching to. It also rearranges directories as needed so that those files are in the right places.

6.3 What Curve Should We Fit?

Before we make any changes to our new branch, we need to figure out how to fit a line to the word count data. Zipf’s Law says that:

The second most common word in a body of text appears half as often as the most common, the third most common appears a third as often, and so on.

In other words, the frequency of a word (\(f\)) is proportional to its inverse rank (\(r\)), \[ f \propto \frac{1}{r^\alpha} \] with a value of \(\alpha\) close to one. The reason \(\alpha\) must be close to one for Zipf’s Law to hold becomes clear if we include it in a modified version of the earlier definition:

The most frequent word will occur approximately \(2^\alpha\) times as often as the second most frequent word, \(3^\alpha\) times as often as the third most frequent word, and so on.

This mathematical expression for Zipf’s Law is an example of a power law. In general, when two variables \(x\) and \(y\) are related through a power law, so that \[ y = ax^b \] taking logarithms of both sides yields a linear relationship: \[ \log(y) = \log(a) + b\log(x) \]

Hence, plotting the variables on a log-log scale reveals this linear relationship. If Zipf’s Law holds, we should have \[ r = cf^{\frac{-1}{\alpha}} \] where \(c\) is a constant of proportionality. The linear relationship between the log word frequency and log rank is then \[ \log(r) = \log(c) - \frac{1}{\alpha}\log(f) \] This suggests that the points on our log-log plot should fall on a straight line with a slope of \(- \tfrac{1}{\alpha}\) and intercept \(\log(c)\). Our goal is to estimate the value of \(\alpha\); we’ll see later that \(c\) is completely defined.

In order to determine the best method for estimating \(\alpha\) we turn to Moreno-Sánchez, Font-Clos, and Corral (2016), which suggests using a method called maximum likelihood estimation. The likelihood function is the probability of our observed data as a function of the parameters in the statistical model that we assume generated it. We estimate the parameters in the model by choosing them to maximize this likelihood; computationally, it is often easier to minimize the negative log likelihood function. Moreno-Sánchez, Font-Clos, and Corral (2016) define the likelihood using a parameter \(\beta\), which is related to the \(\alpha\) parameter in our definition of Zipf’s Law through \(\alpha = \tfrac{1}{\beta-1}\). Under their model, the value of \(c\) is the total number of unique words, or equivalently the largest value of the rank.

Expressed as a Python function, the negative log likelihood function is:

import numpy as np

def nlog_likelihood(beta, counts):
    """Log-likelihood function."""
    likelihood = - np.sum(np.log((1/counts)**(beta - 1) - (1/(counts + 1))**(beta - 1)))
    return likelihood

Obtaining an estimate of \(\beta\) (and thus \(\alpha\)) then becomes a numerical optimization problem, for which we can use the scipy.optimize library. Again following Moreno-Sánchez, Font-Clos, and Corral (2016), we use Brent’s Method with \(1 < \beta \leq 4\).

from scipy.optimize import minimize_scalar

def get_power_law_params(word_counts):
    """Get the power law parameters."""
    mle = minimize_scalar(nlog_likelihood, bracket=(1 + 1e-10, 4),
                          args=word_counts, method='brent')
    beta = mle.x
    alpha = 1 / (beta - 1)
    return alpha

We can then plot the fitted curve on the plot axes (ax) defined in the plotcounts.py script,

def plot_fit(curve_xmin, curve_xmax, max_rank, alpha, ax):
    """
    Plot the power law curve that was fitted to the data.

    Parameters
    ----------
    curve_xmin : float
        Minimum x-bound for fitted curve
    curve_xmax : float
        Maximum x-bound for fitted curve
    max_rank : int
        Maximum word frequency rank.
    alpha : float
        Estimated alpha parameter for the power law.
    ax : matplotlib axes
        Scatter plot to which the power curve will be added.
    """
    xvals = np.arange(curve_xmin, curve_xmax)
    yvals = max_rank * (xvals**(-1 / alpha))
    ax.loglog(xvals, yvals, color='grey')

where the maximum word frequency rank corresponds to \(c\), and \(-1 / \alpha\) the exponent in the power law.

6.4 Verifying Zipf’s Law

Now that we have defined the functions required to fit a curve to our word count plots, we can update plotcounts.py so the entire script reads as follows:

"""Plot word counts."""

import argparse
import numpy as np
import pandas as pd
from scipy.optimize import minimize_scalar


def nlog_likelihood(beta, counts):
    """Log-likelihood function."""
    likelihood = - np.sum(np.log((1/counts)**(beta - 1) - (1/(counts + 1))**(beta - 1)))
    return likelihood


def get_power_law_params(word_counts):
    """Get the power law parameters."""
    mle = minimize_scalar(nlog_likelihood, bracket=(1 + 1e-10, 4),
                          args=word_counts, method='brent')
    beta = mle.x
    alpha = 1 / (beta - 1)
    return alpha


def set_plot_params(param_file):
    """Set the matplotlib rc parameters."""
    if param_file:
        with open(param_file, 'r') as reader:
            param_dict = yaml.load(reader, Loader=yaml.BaseLoader)
    else:
        param_dict = {}
    for param, value in param_dict.items():
        mpl.rcParams[param] = value


def plot_fit(curve_xmin, curve_xmax, max_rank, alpha, ax):
    """
    Plot the power law curve that was fitted to the data.

    Parameters
    ----------
    curve_xmin : float
        Minimum x-bound for fitted curve
    curve_xmax : float
        Maximum x-bound for fitted curve
    max_rank : int
        Maximum word frequency rank.
    alpha : float
        Estimated alpha parameter for the power law.
    ax : matplotlib axes
        Scatter plot to which the power curve will be added.
    """
    xvals = np.arange(curve_xmin, curve_xmax)
    yvals = max_rank * (xvals**(-1 / alpha))
    ax.loglog(xvals, yvals, color='grey')


def main(args):
    """Run the command line program."""
    df = pd.read_csv(args.infile, header=None, names=('word', 'word_frequency'))
    df['rank'] = df['word_frequency'].rank(ascending=False, method='max')
    ax = df.plot.scatter(x='word_frequency', y='rank', loglog=True,
                         figsize=[12, 6], grid=True, xlim=args.xlim)

    alpha = get_power_law_params(df['word_frequency'].to_numpy())
    print('alpha:', alpha)
    # Since the ranks are already sorted, we can take the last one instead of
    # computing which row has the highest rank
    max_rank = df['rank'].to_numpy()[-1]
    # Use the range of the data as the boundaries when drawing the power law curve
    curve_xmin = df['word_frequency'].min()
    curve_xmax = df['word_frequency'].max()

    plot_fit(curve_xmin, curve_xmax, max_rank, alpha, ax)
    ax.figure.savefig(args.outfile)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=argparse.FileType('r'), nargs='?',
                        default='-', help='Word count csv file name')
    parser.add_argument('--outfile', type=str, default='plotcounts.png',
                        help='Output image file name')
    parser.add_argument('--xlim', type=float, nargs=2, metavar=('XMIN', 'XMAX'),
                        default=None, help='X-axis limits')
    args = parser.parse_args()
    main(args)

We can then run the script to obtain the \(\alpha\) value for Dracula and a new plot with a line fitted.

python bin/plotcounts.py results/dracula.csv --outfile results/dracula.png
alpha: 1.0866646252515038

So according to our fit, the most frequent word will occur approximately \(2^{1.1}=2.1\) times as often as the second most frequent word, \(3^{1.1}=3.3\) times as often as the third most frequent word, and so on. Figure 6.1 shows the plot.

Word frequency distribution for the book *Dracula*

Figure 6.1: Word frequency distribution for the book Dracula

The script appears to be working as we’d like, so we can go ahead and commit our changes to the fit development branch:

$ git add bin/plotcounts.py results/dracula.png
$ git commit -m "Added fit to word count data"
[fit 3ff8195] Added fit to word count data
 2 files changed, 62 insertions(+), 1 deletion(-)

If we look at the last couple of commits using git log, we see our most recent change:

$ git log --oneline -n 2
3ff8195 Added fit to word count data
ffb5cee adding ignore file

(We use --oneline and -n 2 to shorten the log display.) But if we switch back to the master branch:

$ git checkout master
$ git branch
* master
  fit

and look at the log, our change is not there:

$ git log --oneline -n 2
ffb5cee adding ignore file
d77bc5c Update dracula plot

We have not lost our work: it just isn’t included in this branch. We can prove this by switching back to the fit branch and checking the log again:

$ git checkout fit
$ git log --oneline -n 2
3ff8195 Added fit to word count data
d77bc5c Update dracula plot

We can also look inside plotcounts.py and see our changes. If we make another change and commit it, that change will also go into the fit branch. For instance, we could add some additional information to one of our docstrings to make it clear what equations were used in estimating \(\alpha\).

def get_power_law_params(word_counts):
    """
    Get the power law parameters.

    References
    ----------
    Moreno-Sanchez et al (2016) define alpha (Eq. 1),
      beta (Eq. 2) and the maximum likelihood estimation (mle)
      of beta (Eq. 6).

    Moreno-Sanchez I, Font-Clos F, Corral A (2016)
      Large-Scale Analysis of Zipf's Law in English Texts.
      PLoS ONE 11(1): e0147073.
      https://doi.org/10.1371/journal.pone.0147073
    """
    mle = minimize_scalar(nlog_likelihood, bracket=(1 + 1e-10, 4),
                          args=word_counts, method='brent')
    beta = mle.x
    alpha = 1 / (beta - 1)
    return alpha
$ git add bin/plotcounts.py
$ git commit -m "Adding Moreno-Sanchez et al (2016) reference"
[fit db1d03f] Adding Moreno-Sanchez et al (2016) reference
 1 file changed, 14 insertions(+), 1 deletion(-)

Finally, if we want to see the differences between two branches, we can use git diff with the same double-dot .. syntax used to view differences between two revisions:

$ git diff master..fit
diff --git a/bin/plotcounts.py b/bin/plotcounts.py
index 4501eaf..d57fd63 100644
--- a/bin/plotcounts.py
+++ b/bin/plotcounts.py
@@ -1,15 +1,89 @@
 """Plot word counts."""
 import argparse
+import numpy as np
 import pandas as pd
+from scipy.optimize import minimize_scalar
+
+
+def nlog_likelihood(beta, counts):
+    """Log-likelihood function."""
+    likelihood = - np.sum(np.log((1/counts)**(beta - 1) - (1/(counts + 1))**(beta - 1)))
+    return likelihood
+
+
+def get_power_law_params(word_counts):
+    """
+    Get the power law parameters.
+
+    References
+    ----------
+    Moreno-Sanchez et al (2016) define alpha (Eq. 1),
+      beta (Eq. 2) and the maximum likelihood estimation (mle)
+      of beta (Eq. 6).
+
+    Moreno-Sanchez I, Font-Clos F, Corral A (2016)
+      Large-Scale Analysis of Zipf's Law in English Texts.
+      PLoS ONE 11(1): e0147073.
+      https://doi.org/10.1371/journal.pone.0147073
+    """
+    mle = minimize_scalar(nlog_likelihood, bracket=(1 + 1e-10, 4),
+                          args=word_counts, method='brent')
+    beta = mle.x
+    alpha = 1 / (beta - 1)
+    return alpha
+
+
+def set_plot_params(param_file):
+    """Set the matplotlib rc parameters."""
+    if param_file:
+        with open(param_file, 'r') as reader:
+            param_dict = yaml.load(reader, Loader=yaml.BaseLoader)
+    else:
+        param_dict = {}
+    for param, value in param_dict.items():
+        mpl.rcParams[param] = value
+
+
+def plot_fit(curve_xmin, curve_xmax, max_rank, alpha, ax):
+    """
+    Plot the power law curve that was fitted to the data.
+
+    Parameters
+    ----------
+    curve_xmin : float
+        Minimum x-bound for fitted curve
+    curve_xmax : float
+        Maximum x-bound for fitted curve
+    max_rank : int
+        Maximum word frequency rank.
+    alpha : float
+        Estimated alpha parameter for the power law.
+    ax : matplotlib axes
+        Scatter plot to which the power curve will be added.
+    """
+    xvals = np.arange(curve_xmin, curve_xmax)
+    yvals = max_rank * (xvals**(-1 / alpha))
+    ax.loglog(xvals, yvals, color='grey')


 def main(args):
+    """Run the command line program.""" 
     df = pd.read_csv(args.infile, header=None, names=('word', 'word_frequency'))
     df['rank'] = df['word_frequency'].rank(ascending=False, method='max')
-    df['inverse_rank'] = 1 / df['rank']
     ax = df.plot.scatter(x='word_frequency', y='rank', loglog=True,
                          figsize=[12, 6], grid=True, xlim=args.xlim)
+
+    alpha = get_power_law_params(df['word_frequency'].to_numpy())
+    print('alpha:', alpha)
+    # Since the ranks are already sorted, we can take the last one instead of
+    # computing which row has the highest rank
+    max_rank = df['rank'].to_numpy()[-1]
+    # Use the range of the data as the boundaries when drawing the power law curve
+    curve_xmin = df['word_frequency'].min()
+    curve_xmax = df['word_frequency'].max()
+
+    plot_fit(curve_xmin, curve_xmax, max_rank, alpha, ax)
     ax.figure.savefig(args.outfile)


diff --git a/results/dracula.png b/results/dracula.png
index d34c46d..1c42702 100644
Binary files a/results/dracula.png and b/results/dracula.png differ

Why Branch?

Why go to all this trouble? Imagine we are in the middle of debugging a change like this when we are asked to make final revisions to a paper that was created using the old code. If we revert plotcount.py to its previous state we might lose our changes. If instead we have been doing the work on a branch, we can switch branches, create the plot, and switch back in complete safety.

6.5 Merging

Now that we are happy with our curve fitting we have three options:

  1. Add our changes to plotcounts.py once again in the master branch.
  2. Stop working in master and start using the fit branch for future development.
  3. Merge the fit and master branches.

The first option is tedious and error-prone; the second will lead to a bewildering proliferation of branches, but the third option is simple, fast, and reliable. To start, let’s make sure we’re in the master branch:

$ git checkout master
$ git branch
  fit
* master

We can now merge the changes in the fit branch into our current branch with a single command:

$ git merge fit
Updating d77bc5c..db1d03f
Fast-forward
 bin/plotcounts.py   |  76 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 results/dracula.png | Bin 22351 -> 31378 bytes
 2 files changed, 75 insertions(+), 1 deletion(-)

Merging doesn’t change the source branch fit, but once the merge is done, all of the changes made in fit are also in the history of master:

$ git log --oneline -n 4
db1d03f (HEAD -> master, fit) Adding Moreno-Sanchez et al (2016) reference
3ff8195 Added fit to word count data
ffb5cee adding ignore file
d77bc5c Update dracula plot

Note that Git automatically creates a new commit (in this case, db1d03f) to represent the merge. If we now run git diff master..fit, Git doesn’t print anything because there aren’t any differences to show.

Now that we have merged all of the changes from fit into master there is no need to keep the fit branch, so we can delete it:

$ git branch -d fit
Deleted branch fit (was db1d03f).

Not Just the Command Line

We have been creating, merging, and deleting branches on the command line, but we can do all of these things using GitKraken, the RStudio IDE, and other GUIs. The operations stay the same; all that changes is how we tell the computer what we want to do.

6.6 Handling Conflicts

A conflict occurs when a line has been changed in different ways in two separate branches or when a file has been deleted in one branch but edited in the other. Merging fit into master went smoothly because there were no conflicts between the two branches, but if we are going to use branches, we must learn how to merge conflicts.

To start, let’s add a README.md file to the master branch containing just the project’s title:

$ nano README.md
$ cat README.md
# Zipf's Law
$ git add README.md
$ git commit -m "Initial commit of README file"
[master b07c14a] Initial commit of README file
 1 file changed, 1 insertion(+)
 create mode 100644 README.md

Now let’s create a new development branch called docs to work on improving the documentation for our code. We will use git checkout -b to create a new branch and switch to it in a single step:

$ git checkout -b docs
Switched to a new branch 'docs'
$ git branch
* docs
  master

On this new branch, let’s add some information to the README file:

# Zipf's Law

These Zipf's Law scripts tally the occurrences of words in text files
and plot each word's rank versus its frequency.
$ git add README.md
$ git commit -m "Added repository overview"
[docs a41a6ea] Added repository overview
 1 file changed, 3 insertions(+)

In order to create a conflict, let’s switch back to the master branch. The changes we made in the docs branch are not present:

$ git checkout master
Switched to branch 'master'
$ cat README.md
# Zipf's Law

Let’s add some information about the contributors to our work:

# Zipf's Law

## Contributors

- Amira Khan <amira@zipf.org>
$ git add README.md
$ git commit -m "Added contributor list"
[master a102c83] Added contributor list
 1 file changed, 4 insertions(+)

We now have two branches, master and docs, in which we have changed README.md in different ways:

$ git diff docs..master
diff --git a/README.md b/README.md
index fd3de28..cf317ea 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,5 @@
 # Zipf's Law

-These Zipf's Law scripts tally the occurrences of words in text files
-and plot each word's rank versus its frequency.
+## Contributors
+
+- Amira Khan <amira@zipf.org>

When we try to merge docs into master, Git doesn’t know which of these changes to keep:

$ git merge docs master
Auto-merging README.md
CONFLICT (content): Merge conflict in README.md
Automatic merge failed; fix conflicts and then commit the result.

If we look in README.md, we see that Git has kept both sets of changes, but has marked which came from where:

$ cat README.md
# Zipf's Law

<<<<<<< HEAD
## Contributors

- Amira Khan <amira@zipf.org>
=======
These Zipf's Law scripts tally the occurrences of words in text files
and plot each word's rank versus its frequency.
>>>>>>> docs

The lines from <<<<<<< HEAD to ======= are what was in master, while the lines from there to >>>>>>> docs show what was in docs. If there were several conflicting regions in the same file, Git would mark each one this way.

We have to decide what to do next: keep the master changes, keep those from docs, edit this part of the file to combine them, or write something new. Whatever we do, we must remove the >>>, ===, and <<< markers. Let’s combine the two sets of changes:

$ nano README.md
$ cat README.md
# Zipf's Law

These Zipf's Law scripts tally the occurrences of words in text files
and plot each word's rank versus its frequency.

## Contributors

- Amira Khan <amira@zipf.org>

We can now add the file and commit the change, just as we would after any other edit:

$ git add README.md
$ git commit -m "Merging README additions"
[master 4ffeaa4] Merging README additions

Our branch’s history now shows a single sequence of commits, with the master changes on top of the earlier docs changes:

$ git log --oneline -n 4
4ffeaa4 (HEAD -> master) Merging README additions
a102c83 Added contributors list
a41a6ea (docs) Added repository overview
b07c14a Initial commit of README file

If we want to see what really happened, we can add the --graph option to git log:

$ git log --oneline --graph -n 4
*   4ffeaa4 (HEAD -> master) Merging README additions
|\  
| * a41a6ea (docs) Added repository overview
* | a102c83 Added contributors list
|/  
* b07c14a Initial commit of README file

At this point we can delete the docs branch:

$ git branch -d docs
Deleted branch docs (was 4ffeaa4).

Alternatively, we can keep using docs for documentation updates. Each time we switch to it, we merge changes from master into docs, do our editing (while switching back to master or other branches as needed to work on the code), and then merge from docs to master once the documentation is updated.

6.7 A Branch-Based Workflow

Now that we’re familiar with the core concepts and commands for branching, we need to consider how best to incorporate them into our regular coding practice. If we are working on our own computer, this workflow will help us keep track of what we are doing:

  1. git checkout master to make sure we are in the master branch.

  2. git checkout -b name-of-feature to create a new branch. We always create a branch when making changes, since we never know what else might come up. The branch name should be as descriptive as a variable name or filename would be.

  3. Make our changes. If something occurs to us along the way—for example, if we are writing a new function and realize that the documentation for some other function should be updated—we do not do that work in this branch just because we happen to be there. Instead, we commit our changes, switch back to master, and create a new branch for the other work.

  4. When the new feature is complete, we git merge master name-of-feature to get any changes we merged into master after creating name-of-feature and resolve any conflicts. This is an important step: we want to do the merge and test that everything still works in our feature branch, not in master.

  5. Finally, we switch back to master and git merge name-of-feature master to merge our changes into master. We should not have any conflicts, and all of our tests should pass.

Most experienced developers use this branch-per-feature workflow, but what exactly is a “feature”? These rules make sense for small projects:

  1. Anything cosmetic that is only one or two lines long can be done in master and committed right away. Here, “cosmetic” means changes to comments or documentation: nothing that affects how code runs, not even a simple variable renaming.

  2. A pure addition that doesn’t change anything else is a feature and goes into a branch. For example, if we run a new analysis and save the results, that should be done on its own branch because it might take several tries to get the analysis to run, and we might interrupt ourselves to fix things that we discover aren’t working.

  3. Every change to code that someone might want to undo later in one step is a feature. For example, if a new parameter is added to a function, then every call to the function has to be updated. Since neither alteration makes sense without the other, those changes are considered a single feature and should be done in one branch.

The hardest thing about using a branch-per-feature workflow is sticking to it for small changes. As the first point in the list above suggests, most people are pragmatic about this on small projects; on large ones, where dozens of people might be committing, even the smallest and most innocuous change needs to be in its own branch so that it can be reviewed (which we discuss below).

6.8 Using Other People’s Work

So far we have used Git to manage individual work, but it really comes into its own when we are working with other people. We can do this in two ways:

  1. Everyone has read and write access to a single shared repository.

  2. Everyone can read from the project’s main repository, but only a few people can commit changes to it. The project’s other contributors fork the main repository to create one that they own, do their work in that, and then submit their changes to the main repository.

The first approach works well for teams of up to half a dozen people who are all comfortable using Git, but if the project is larger, or if contributors are worried that they might make a mess in the master branch, the second approach is safer.

Git itself doesn’t have any notion of a “main repository”, but forges like GitHub, GitLab, and BitBucket all encourage people to use Git in ways that effectively create one. Suppose, for example, that Sami wants to contribute to the Zipf’s Law code that Amira is hosting on GitHub at https://github.com/amira-khan/zipf. Sami can go to that URL and click on the “Fork” button in the upper right corner (Figure 6.2). GitHub immediately creates a copy of Amira’s repository within Sami’s account on GitHub’s own servers.

Forking

Figure 6.2: Forking

When the command completes, the setup on GitHub now looks like Figure 6.3. Nothing has happened yet on Sami’s own machine: the new repository exists only on GitHub. When Sami explores its history, they see that it contains all of the changes Amira made.

After Forking

Figure 6.3: After Forking

A copy of a repository is called a clone. In order to start working on the project, Sami needs a clone of their repository (not Amira’s) on their own computer. We will modify Sami’s prompt to include their desktop user ID (sami) and working directory (initially ~) to make it easier to follow what’s happening:

sami:~ $ git clone https://github.com/sami-virtanen/zipf.git
Cloning into 'zipf'...
remote: Enumerating objects: 60, done.
remote: Counting objects: 100% (60/60), done.
remote: Compressing objects: 100% (42/42), done.
remote: Total 60 (delta 17), reused 59 (delta 16), pack-reused 0
Unpacking objects: 100% (60/60), done.

This command creates a new directory with the same name as the project, i.e., zipf. When Sami goes into this directory and runs ls and git log, they see that all of the project’s files and history are there:

sami:~ $ cd zipf
sami:~/zipf $ ls
README.md       bin             data             results
sami:~/zipf $ git log --oneline -n 4
4ffeaa4 (HEAD -> master) Merging README additions
a102c83 Added contributors list
a41a6ea Added repository overview
b07c14a Initial commit of README file

Sami also sees that Git has automatically created a remote for their repository that points back at their repository on GitHub:

sami:~/zipf $ git remote -v
origin  https://github.com/sami-virtanen/zipf.git (fetch)
origin  https://github.com/sami-virtanen/zipf.git (push)

Sami can pull changes from their fork and push work back there, but needs to do one more thing before getting the changes from Amira’s repository:

sami:~/zipf $ git remote add upstream https://github.com/amira-khan/zipf.git
sami:~/zipf $ git remote -v
origin      https://github.com/sami-virtanen/zipf.git (fetch)
origin      https://github.com/sami-virtanen/zipf.git (push)
upstream    https://github.com/amira-khan/zipf.git (fetch)
upstream    https://github.com/amira-khan/zipf.git (push)

Sami has called their new remote upstream because it points at the repository theirs is derived from. They could use any name, but upstream is a nearly universal convention.

With this remote in place, Sami is finally set up. Suppose, for example, that Amira has modified the project’s README.md file to add Sami as a contributor. (Again, we show Amira’s user ID and working directory in her prompt to make it clear who’s doing what).

amira:~/zipf $ pwd
/Users/amira/zipf
amira:~/zipf $ nano README.md
amira:~/zipf $ cat README.md
# Zipf's Law

These Zipf's Law scripts tally the occurrences of words in text files
and plot each word's rank versus its frequency.

## Contributors

- Amira Khan <amira@zipf.org>
- Sami Virtanen

Amira commits her changes and pushes them to her repository on GitHub:

amira:~/zipf $ git commit -a -m "Adding Sami as a contributor"
[master 766c2cd] Adding Sami as a contributor
 1 file changed, 1 insertion(+)
amira:~/zipf $ git push origin master
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 4 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 315 bytes | 315.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
To https://github.com/amira-khan/zipf.git
   b0c3fc6..766c2cd  master -> master

Amira’s changes are now on her desktop and in her GitHub repository but not in either of Sami’s repositories (local or remote). Since Sami has created a remote that points at Amira’s GitHub repository, though, they can easily pull those changes to their desktop:

sami:~/zipf $ git pull upstream master
remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (1/1), done.
remote: Total 3 (delta 2), reused 3 (delta 2), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/amira-khan/zipf
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> upstream/master
Updating 1c63b8b..893fc9f
Fast-forward
 README.md | 1 +
 1 file changed, 1 insertion(+)

Pulling from a repository owned by someone else is no different than pulling from a repository we own. In either case, Git merges the changes and asks us to resolve any conflicts that arise. The only significant difference is that, as with git push and git pull, we have to specify both a remote and a branch: in this case, upstream and master.

6.9 Pull Requests

Sami can now get Amira’s work, but how can Amira get Sami’s? She could create a remote that pointed at Sami’s repository on GitHub and periodically pull in Sami’s changes, but that would lead to chaos, since we could never be sure that everyone’s work was in any one place at the same time. Instead, almost everyone uses pull requests. They aren’t part of Git itself, but are supported by all major online forges.

A pull request is essentially a note saying, “Someone would like to merge branch A of repository B into branch X of repository Y”. The pull request does not contain the changes, but instead points at two particular branches. That way, the difference displayed is always up to date if either branch changes.

But a pull request can store more than just the source and destination branches: it can also store comments people have made about the proposed merge. Users can comment on the pull request as a whole, or on particular lines, and mark comments as out of date if the author of the pull request updates the code that the comment is attached to. Complex changes can go through several rounds of review and revision before being merged, which makes pull requests the review system we all wish journals actually had.

To see this in action, suppose Sami wants to add their email address to README.md. They create a new branch and switch to it:

sami:~/zipf $ git checkout -b adding-email
Switched to a new branch 'adding-email'

then make a change and commit it:

sami:~/zipf $ nano README.md
sami:~/zipf $ git commit -a -m "Adding my email address"
[master b8938eb] Adding my email address
 1 file changed, 1 insertion(+), 1 deletion(-)
sami:~/zipf $ git diff -r HEAD~1
diff --git a/README.md b/README.md
index a55a9bb..eb24a3f 100644
--- a/README.md
+++ b/README.md
@@ -6,4 +6,4 @@ and plot each word's rank versus its frequency.
 ## Contributors
 
 - Amira Khan <amira@zipf.org>
-- Sami Virtanen
+- Sami Virtanen <sami@zipf.org>

Sami’s changes are only in their local repository. They cannot create a pull request until those changes are on GitHub, so they push their new branch to their repository on GitHub:

sami:~/zipf $ git push origin adding-email
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 4 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 315 bytes | 315.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
remote: 
remote: Create a pull request for 'adding-email' on GitHub by visiting:
remote:      https://github.com/sami-virtanen/zipf/pull/new/adding-email
remote: 
To https://github.com/sami-virtanen/zipf.git
 * [new branch]      adding-email -> adding-email

When Sami goes to their GitHub repository in the browser, GitHub notices that they have just pushed a new branch and asks them if they want to create a pull request (Figure 6.4).

After Sami Pushes

Figure 6.4: After Sami Pushes

When Sami clicks on the button, GitHub displays a page showing the default source and destination of the pull request and a pair of editable boxes for the pull request’s title and a longer comment (Figure 6.5).

Starting Pull Request

Figure 6.5: Starting Pull Request

If they scroll down, Sami can see a summary of the changes that will be in the pull request (Figure 6.6).

Summary of Pull Request

Figure 6.6: Summary of Pull Request

The top (title) box is autofilled with the previous commit message, so Sami adds an extended explanation to provide additional context before clicking on “Create Pull Request” (Figure 6.7). When they do, GitHub displays a page showing the new pull request, which has a unique serial number (Figure 6.8). Note that this pull request is displayed in Amira’s repository rather than Sami’s since it is Amira’s repository that will be affected if the pull request is merged.

Filling In Pull Request

Figure 6.7: Filling In Pull Request

New Pull Request

Figure 6.8: New Pull Request

Some time later, Amira checks her repository and sees that there is a pull request (Figure 6.9). Clicking on the “Pull requests” tab brings up a list of PRs (Figure 6.10) and clicking on the pull request link itself displays its details (Figure 6.11), Which appears identical to the details seen by .

Viewing Pull Request

Figure 6.9: Viewing Pull Request

Listing Pull Requests

Figure 6.10: Listing Pull Requests

Pull Request Details

Figure 6.11: Pull Request Details

Since there are no conflicts, GitHub will let Amira merge the PR immediately using the “Merge pull request” button. She could also discard or reject it without merging using the “Close pull request” button. Instead, she clicks on the “Files changed” tab to see what Sami has changed (Figure 6.12).

Files Changed

Figure 6.12: Files Changed

If she moves her mouse over particular lines,) a white-on-blue cross appears near the numbers to indicate that she can add comments (Figure 6.13). She clicks on the marker beside her own name and writes a comment: She only wants to make one comment rather than write a lengthier multi-comment review, so she chooses “Add single comment” (Figure 6.14). GitHub redisplays the page with her remarks inserted (Figure 6.15).

Comment Marker

Figure 6.13: Comment Marker

Writing Comment

Figure 6.14: Writing Comment

Pull Request With Comment

Figure 6.15: Pull Request With Comment

While Amira is working, GitHub has been emailing notifications to both Sami and Amira. When Sami clicks on the link in their email notification, it takes them to the PR and shows Amira’s comment. Sami changes README.md, commits, and pushes, but does not create a new pull request or do anything to the existing one. As explained above, a PR is a note asking that two branches be merged, so if either end of the merge changes, the PR updates automatically.

Sure enough, when Amira looks at the PR again a few moments later she sees Sami’s changes (Figure 6.16). Satisfied, she goes back to the “Conversation” tab and clicks on “Merge”. The icon at the top of the PR’s page changes text and color to show that the merge was successful (Figure 6.17).

Pull Request With Fix

Figure 6.16: Pull Request With Fix

Successful Merge

Figure 6.17: Successful Merge

To get those changes from GitHub to her desktop repository, Amira uses git pull:

amira:~/zipf $ git pull origin master
remote: Enumerating objects: 9, done.
remote: Counting objects: 100% (9/9), done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 7 (delta 4), reused 4 (delta 2), pack-reused 0
Unpacking objects: 100% (7/7), done.
From https://github.com/amira-khan/zipf
   893fc9f..4bf4b68  master     -> origin/master
Updating 893fc9f..4bf4b68
Fast-forward
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

To get the change they just made from their adding-email branch into their master branch, Sami could use git merge on the command line. It’s a little clearer, though, if they also use git pull from their upstream repository (i.e., Amira’s repository) so that they’re sure to get any other changes that Amira may have merged:

sami:~/zipf $ git checkout master
Switched to branch 'master'
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)
sami:~/zipf $ git pull upstream master
remote: Enumerating objects: 8, done.
remote: Counting objects: 100% (8/8), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 4 (delta 2), reused 3 (delta 2), pack-reused 0
Unpacking objects: 100% (4/4), done.
From https://github.com/amira-khan/zipf
 * branch            master     -> FETCH_HEAD
   893fc9f..4bf4b68  master     -> upstream/master
Updating 893fc9f..4bf4b68
Fast-forward
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

All four repositories are now synchronized.

6.10 Handling Conflicts in Pull Requests

Finally, suppose that Amira and Sami have decided to collaborate more extensively on this project. Amira has added Sami as a collaborator to the GitHub repository. Now Sami can make contributions directly to the repository, rather than via a pull request from a forked repository.

Sami makes a change to README.md in the master branch on GitHub. Meanwhile, Amira is making a conflicting change to the same file in a different branch. When Amira creates her pull request, GitHub will detect the conflict and report that the PR cannot be merged automatically (Figure 6.18).

Conflict in a Pull Request

Figure 6.18: Conflict in a Pull Request

Amira can solve this problem with the tools she already has. If she has made her changes in a branch called editing-readme, the steps are:

  1. Pull Sami’s on the master branch of the GitHub repository into the master branch of her desktop repository.

  2. Merge from the master branch of her desktop repository to the editing-readme branch in the same repository.

  3. Push her updated editing-readme branch to her repository on GitHub. The pull request from there back to the master branch of the main repository will update automatically.

GitHub and other forges do allow people to merge conflicts through their browser-based interfaces, but doing it on our desktop means we can use our favorite editor to resolve the conflict. It also means that if the change affects the project’s code, we can run everything to make sure it still works.

But what if Sami or someone else merges another change while Amira is resolving this one, so that by the time she pushes to her repository there is another, different, conflict? In theory this cycle could go on forever; in practice, it reveals a communication problem that Amira (or someone) needs to address. If two or more people are constantly making incompatible changes to the same files, they should discuss who’s supposed to be doing what, or rearrange the project’s contents so that they aren’t stepping on each other’s toes.

6.11 Summary

Branches and pull requests seem complicated at first, but they quickly become second nature. Everyone involved in the project can work at their own pace on what they want to, picking up others’ changes and submitting their own whenever they want. More importantly, this workflow gives everyone has a chance to review each other’s work. As we discuss in Section K.5, doing reviews doesn’t just prevent errors from creeping in: it is also an effective way to spread understanding and skills.

6.12 Exercises

6.12.1 Explaining options

  1. What do the --oneline and -n options for git log do?
  2. What other options does git log have that you would find useful?

6.12.2 Modifying prompt

Modify your shell prompt so that it shows the branch you are on when you are in a repository.

6.12.3 Ignoring files

GitHub maintains a collection of .gitignore files for projects of various kinds. Look at the sample .gitignore file for Python: how many of the ignored files do you recognize? Where could you look for more information about them?

6.12.4 Creating the same file twice

Create a branch called same. In it, create a file called same.txt that contains your name and the date.

Switch back to master. Check that same.txt does not exist, then create the same file with exactly the same contents.

  1. What will git diff master..same show? (Try to answer the question before running the command.)

  2. What will git merge same master do? (Try to answer the question before running the command.)

6.12.5 Deleting a branch without merging

Create a branch called experiment. In it, create a file called experiment.txt that contains your name and the date, then switch back to master.

  1. What happens when you try to delete the experiment branch using git branch -d experiment? Why?

  2. What option can you give Git to delete the experiment branch? Why should you be very careful using it?

  3. What do you think will happen if you try to delete the branch you are currently on using this flag?

6.12.6 Tracing changes

Chartreuse and Fuchsia are collaborating on a project. Describe what is in each of the four repositories involved after each of the steps below.

  1. Chartreuse creates a repository containing a README.md file on GitHub and clones it to their desktop.
  2. Fuchsia forks that repository on GitHub and clones their copy to their desktop.
  3. Fuchsia adds a file fuchsia.txt to the master branch of their desktop repository and pushes that change to their repository on GitHub.
  4. Fuchsia creates a pull request from the master branch of their repository on GitHub to the master branch of Chartreuse’s repository on GitHub.
  5. Chartreuse does not merge Fuchsia’s PR. Instead, they add a file chartreuse.txt to the master branch of their desktop repository and push that change to their repository on GitHub.
  6. Fuchsia adds a remote to their desktop repository called upstream that points at Chartreuse’s repository on GitHub and runs git pull upstream master, then merges any changes or conflicts.
  7. Fuchsia pushes from the master branch of their desktop repository to the master branch of their GitHub repository.
  8. Chartreuse merges Fuchsia’s pull request.
  9. Chartreuse runs git pull origin master on the desktop.

6.13 Key Points

  • Use a branch-per-feature workflow to develop new features while leaving the master branch in working order.
  • git branch creates a new branch.
  • git checkout switches between branches.
  • git merge merges changes from another branch into the current branch.
  • Conflicts occur when files or parts of files are changed in different ways on different branches.
  • Version control systems do not allow people to overwrite changes silently; instead, they highlight conflicts that need to be resolved.
  • Forking a repository makes a copy of it on a server.
  • Cloning a repository with git clone creates a local copy of a remote repository.
  • Create a remote called upstream to point to the repository a fork was derived from.
  • Create pull requests to submit changes from your fork to the upstream repository.