H Solutions

FIXME: Update the ordering of this section when we are finished rearranging.

Chapter 2

Exercise 2.16.1

The -l option makes ls use a long listing format, showing not only the file/directory names but also additional information such as the file size and the time of its last modification. If you use both the -h option and the -l option, this makes the file size “human readable”, i.e. displaying something like 5.3K instead of 5369.

Exercise 2.16.2

The files/directories in each directory are sorted by time of last change.

Exercise 2.16.3

  1. No: . stands for the current directory.
  2. No: / stands for the root directory.
  3. No: Amanda’s home directory is /Users/amanda.
  4. No: this goes up two levels, i.e. ends in /Users.
  5. Yes: ~ stands for the user’s home directory, in this case /Users/amanda.
  6. No: this would navigate into a directory home in the current directory if it exists.
  7. Yes: unnecessarily complicated, but correct.
  8. Yes: shortcut to go back to the user’s home directory.
  9. Yes: goes up one level.

Exercise 2.16.4

  1. No: there is a directory backup in /Users.
  2. No: this is the content of Users/thing/backup, but with .. we asked for one level further up.
  3. No: see previous explanation.
  4. Yes: ../backup/ refers to /Users/backup/.

Exercise 2.16.5

  1. No: pwd is not the name of a directory.
  2. Yes: ls without directory argument lists files and directories in the current directory.
  3. Yes: uses the absolute path explicitly.

Exercise 2.16.6

  1. The touch command updated a file’s timestamp. If no file exists with the given name, touch will create one. You can observe this newly generated file by typing ls at the command line prompt. my_file.txt can also be viewed in your GUI file explorer.

  2. When you inspect the file with ls -l, note that the size of my_file.txt is 0 bytes. In other words, it contains no data. If you open my_file.txt using your text editor it is blank.

  3. Some programs do not generate output files themselves, but instead require that empty files have already been generated. When the program is run, it searches for an existing file to populate with its output. The touch command allows you to efficiently generate a blank text file to be used by such programs.

Exercise 2.16.7

$ rm: remove regular file 'thesis_backup/quotations.txt'? y

The -i option will prompt before (every) removal (use y to confirm deletion or n to keep the file). The Unix shell doesn’t have a trash bin, so all the files removed will disappear forever. By using the -i option, we have the chance to check that we are deleting only the files that we want to remove.

Exercise 2.16.8

$ mv ../analyzed/sucrose.dat ../analyzed/maltose.dat .

Recall that .. refers to the parent directory (i.e. one above the current directory) and that . refers to the current directory.

Exercise 2.16.9

  1. No. While this would create a file with the correct name, the incorrectly named file still exists in the directory and would need to be deleted.
  2. Yes, this would work to rename the file.
  3. No, the period(.) indicates where to move the file, but does not provide a new file name; identical file names cannot be created.
  4. No, the period(.) indicates where to copy the file, but does not provide a new file name; identical file names cannot be created.

Exercise 2.16.10

We start in the /Users/jamie/data directory, and create a new folder called recombine. The second line moves (mv) the file proteins.dat to the new folder (recombine). The third line makes a copy of the file we just moved. The tricky part here is where the file was copied to. Recall that .. means “go up a level”, so the copied file is now in /Users/jamie. Notice that .. is interpreted with respect to the current working directory, not with respect to the location of the file being copied. So, the only thing that will show using ls (in /Users/jamie/data) is the recombine folder.

  1. No, see explanation above. proteins-saved.dat is located at /Users/jamie
  2. Yes
  3. No, see explanation above. proteins.dat is located at /Users/jamie/data/recombine
  4. No, see explanation above. proteins-saved.dat is located at /Users/jamie

Exercise 2.16.11

If given more than one file name followed by a directory name (i.e. the destination directory must be the last argument), cp copies the files to the named directory.

If given three file names, cp throws an error such as the one below, because it is expecting a directory name as the last argument.

cp: target 'morse.txt' is not a directory

Exercise 2.16.12

The solution is 3.

1. shows all files whose names contain zero or more characters (*) followed by the letter t, then zero or more characters (*) followed by ane.pdb. This gives ethane.pdb methane.pdb octane.pdb pentane.pdb.

2. shows all files whose names start with zero or more characters (*) followed by the letter t, then a single character (?), then ne. followed by zero or more characters (*). This will give us octane.pdb and pentane.pdb but doesn’t match anything which ends in thane.pdb.

3. fixes the problems of option 2 by matching two characters (??) between t and ne. This is the solution.

4. only shows files starting with ethane..

Exercise 2.16.13

mv *.dat analyzed

Jamie needs to move her files fructose.dat and sucrose.dat to the analyzed directory. The shell will expand *.dat to match all .dat files in the current directory. The mv command then moves the list of .dat files to the “analyzed” directory.

Exercise 2.16.14

The first two sets of commands achieve this objective. The first set uses relative paths to create the top level directory before the subdirectories.

The third set of commands will give an error because mkdir won’t create a subdirectory of a non-existant directory: the intermediate level folders must be created first.

The final set of commands generates the ‘raw’ and ‘processed’ directories at the same level as the ‘data’ directory.

Exercise 2.16.15

In the first example with >, the string “hello” is written to testfile01.txt, but the file gets overwritten each time we run the command.

We see from the second example that the >> operator also writes “hello” to a file (in this casetestfile02.txt), but appends the string to the file if it already exists (i.e. when we run it for the second time).

Exercise 2.16.16

Option 3 is correct. For option 1 to be correct we would only run the head command. For option 2 to be correct we would only run the tail command. For option 4 to be correct we would have to pipe the output of head into tail -n 2 by doing head -n 3 animals.txt | tail -n 2 > animals-subset.txt

Exercise 2.16.17

Option 4 is the solution. The pipe character | is used to feed the standard output from one process to the standard input of another. > is used to redirect standard output to a file. Try it in the data-shell/molecules directory!

Exercise 2.16.18

$ sort salmon.txt | uniq

Exercise 2.16.19

The head command extracts the first 5 lines from animals.txt. Then, the last 3 lines are extracted from the previous 5 by using the tail command. With the sort -r command those 3 lines are sorted in reverse order and finally, the output is redirected to a file final.txt. The content of this file can be checked by executing cat final.txt. The file should contain the following lines:

2012-11-06,rabbit
2012-11-06,deer
2012-11-05,raccoon

Exercise 2.16.20

cut selects substrings from a line by:

  • breaking the string into pieces wherever it finds a separator (-d ,), which in this case is a comma, and
  • keeping one or more of the resulting fields (-f 2).

Any single character can be used as a separator, but there is no way to escape characters: for example, if the string a,"b,c",d is split on commas, all three commas take effect.

Exercise 2.16.21

Option 4 is the correct answer. If you have difficulty understanding why, try running the commands or sub-sections of the pipelines (e.g., the code between pipes). Make sure you are in the data-shell/data directory.

Exercise 2.16.22

  1. A solution using two wildcard expressions: shell $ ls *A.txt $ ls *B.txt
  2. The output from the new commands is separated because there are two commands.
  3. When there are no files ending in A.txt, or there are no files ending in B.txt.

Exercise 2.16.23

  1. This would remove .txt files with one-character names
  2. This is correct answer
  3. The shell would expand * to match everything in the current directory, so the command would try to remove all matched files and an additional file called .txt
  4. The shell would expand *.* to match all files with any extension, so this command would delete all files

Exercise 2.16.24

The second version is the one we want to run. This prints to screen everything enclosed in the quote marks, expanding the loop variable name because we have prefixed it with a dollar sign.

The first version redirects the output from the command echo analyze $file to a file, analyzed-$file. A series of files is generated: analyzed-cubane.pdb, analyzed-ethane.pdb etc.

Try both versions for yourself to see the output! Be sure to open the analyzed-*.pdb files to view their contents.

Exercise 2.16.25

The first code block gives the same output on each iteration through the loop. Bash expands the wildcard *.pdb within the loop body (as well as before the loop starts) to match all files ending in .pdb and then lists them using ls. The expanded loop would look like this:

$ for datafile in cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
> do
>   ls cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
> done
cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb
cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb

The second code block lists a different file on each loop iteration. The value of the datafile variable is evaluated using $datafile, and then listed using ls.

cubane.pdb
ethane.pdb
methane.pdb
octane.pdb
pentane.pdb
propane.pdb

Exercise 2.16.26

Part 1

4 is the correct answer. * matches zero or more characters, so any file name starting with the letter c, followed by zero or more other characters will be matched.

Part 2

4 is the correct answer. * matches zero or more characters, so a file name with zero or more characters before a letter c and zero or more characters after the letter c will be matched.

Exercise 2.16.27

Part 1

  1. The text from each file in turn gets written to the alkanes.pdb file. However, the file gets overwritten on each loop interation, so the final content of alkanes.pdb is the text from the propane.pdb file.

Part 2

3 is the correct answer. >> appends to a file, rather than overwriting it with the redirected output from a command. Given the output from the cat command has been redirected, nothing is printed to the screen.

Exercise 2.16.28

If a command causes something to crash or hang, it might be useful to know what that command was, in order to investigate the problem. Were the command only be recorded after running it, we would not have a record of the last command run in the event of a crash.

Exercise 2.16.29

novel-????-[ab]*.{txt,pdf} matches:

  • Files whose names started with the letters novel-,
  • which is then followed by exactly four characters (since each ? matches one character),
  • followed by another literal -,
  • followed by either the letter a or the letter b,
  • followed by zero or more other characters (the *),
  • followed by .txt or .pdf.

Chapter 3

Exercise 3.8.1

$ cd ~/zipf
Change into the zipf directory, which is located in the home directory (designated by ~).

$ for file in $(find . -name "*.bak")
> do
>   rm $file
> done

Find all the files ending in .bak and remove them one by one.

$ rm bin/summarize_all_books.sh
Remove the summarize_all_books.sh script.

$ rm -r results
Recursively remove each file in the results directory and then remove the directory itself. (It is necessary to remove all the files first because you cannot remove a non-empty directory.)

Exercise 3.8.2

The correct answer is 2.

The special variables $1, $2 and $3 represent the command line arguments given to the script, such that the commands run are:

$ head -n 1 cubane.pdb ethane.pdb octane.pdb pentane.pdb propane.pdb
$ tail -n 1 cubane.pdb ethane.pdb octane.pdb pentane.pdb propane.pdb

The shell does not expand '*.pdb' because it is enclosed by quote marks. As such, the first argument to the script is '*.pdb' which gets expanded within the script by head and tail.

Exercise 3.8.3

# Shell script which takes two arguments:
#    1. a directory name
#    2. a file extension
# and prints the name of the file in that directory
# with the most lines which matches the file extension.

wc -l $1/*.$2 | sort -n | tail -n 2 | head -n 1

Exercise 3.8.4

In each case, the shell expands the wildcard in *.pdb before passing the resulting list of file names as arguments to the script.

Script 1 would print out a list of all files containing a dot in their name. The arguments passed to the script are not actually used anywhere in the script.

Script 2 would print the contents of the first 3 files with a .pdb file extension. $1, $2, and $3 refer to the first, second, and third argument respectively.

Script 3 would print all the arguments to the script (i.e. all the .pdb files), followed by .pdb. $@ refers to all the arguments given to a shell script.

cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb.pdb

Exercise 3.8.5

The correct answer is 3, because the -w option looks only for whole-word matches. The other options will also match “of” when part of another word.

Exercise 3.8.6

# Obtain unique years from multiple comma-delimited lists of titles and publication years
# Usage: bash year.sh file1.txt file2.txt ...

for filename in $*
do
  cut -d , -f 2 $filename | sort -n | uniq
done

Exercise 3.8.7

for sister in Harriet Marianne
do
    echo $sister:
>   grep -ow $sister sense_and_sensibility.txt | wc -l
done

And alternative but slightly inferior solution is:

for sister in Harriet Marianne
do
    echo $sister:
>   grep -ocw $sister sense_and_sensibility.txt
done

This solution is inferior because grep -c only reports the number of lines matched. The total number of matches reported by this method will be lower if there is more than one match per line.

Exercise 3.8.8

The correct answer is 1. Putting the match expression in quotes prevents the shell expanding it, so it gets passed to the find command.

Option 2 is incorrect because the shell expands *s.txt instead of passing the wildcard expression to find.

Option 3 is incorrect because it searches the contents of the files for lines which do not match “temp”, rather than searching the file names.

Exercise 3.8.9

  1. Find all files with a .dat extension recursively from the current directory
  2. Count the number of lines each of these files contains
  3. Sort the output from step 2. numerically

Exercise 3.8.10

Assuming that Ahmed’s home is our working directory we type:

$ find ./ -type f -mtime -1 -user ahmed

Chapter 4

Exercise 4.11.1

Running Python statement directly from the command line is useful as a basic calculator and for simple string operations. Since anything more complicated than that usually requires more than one statement, it is often convenient to separate commands with semi-colons, as in:

$ python -c "import math; print(math.log(123))"

Exercise 4.11.2

The plotcounts.py script should read as follows:

"""Plot word counts."""

import argparse
import pandas as pd


def main(args):
    df = pd.read_csv(args.infile, header=None, names=('word', 'word_frequency'))
    df['rank'] = df['word_frequency'].rank(ascending=False, method='max')
    df['inverse_rank'] = 1 / df['rank']
    ax = df.plot.scatter(x='word_frequency', y='inverse_rank',
                         figsize=[12, 6], grid=True)
    ax.figure.savefig(args.outfile)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=argparse.FileType('r'), nargs='?',
                        default='-', help='Word count csv file name')
    parser.add_argument('--outfile', type=str, default='plotcounts.png',
                        help='Output image file name')
    parser.add_argument('--xlim', type=float, nargs=2, metavar=('XMIN', 'XMAX'),
                        default=None, help='X-axis limits')
    args = parser.parse_args()
    main(args)

Chapter 5

Exercise 5.11.1

Amira does not need to make the heaps-law subdirectory a Git repository because the zipf repository will track everything inside it regardless of how deeply nested.

Amira shouldn’t run git init in heaps-law because nested Git repositories can interfere with each other. If someone commits something in the inner repository, Git will not know whether to record the changes in that repository, the outer one, or both.

Exercise 5.11.2

  • git status now shows:

    On branch book
    Untracked files:
      (use "git add <file>..." to include in what will be committed)
    
            example.txt
    
    nothing added to commit but untracked files present (use "git add" to track)
  • Nothing has happened to the file, it still exists but Git no longer has it in the staging area.

Exercise 5.11.3

Using git diff --word-diff might be easier since it shows exactly what has been changed.

Exercise 5.11.4

  1. Would only create a commit if files have already been staged.
  2. Would try to create a new repository.
  3. Is correct: first add the file to the staging area, then commit.
  4. Would try to commit a file “my recent changes” with the message myfile.txt.

Exercise 5.11.5

  1. Change names.txt and old-computers.txt using an editor like Nano.
  2. Add both files to the staging area with git add *.txt.
  3. Check that both files are there with git status.
  4. Commit both files at once with git commit.

Exercise 5.11.6

  1. Go into your home directory with cd ~.
  2. Create a new folder called bio with mkdir bio.
  3. Go into it with cd bio.
  4. Turn it into a repository with git init.
  5. Create your biography using Nano or another text editor.
  6. Add it and commit it in a single step with git commit -a -m "Some message".
  7. Modify the file.
  8. Use git diff to see the differences.

Exercise 5.11.7

To ignore only the contents of results/plots, add this line to .gitignore:

results/plots/

Exercise 5.11.8

Add the following two lines to .gitignore:

*.dat           # ignore all data files
!final.dat      # except final.data

The exclamation point ! includes a previously-excluded entry.

Note also that if we have previously committed .dat files in this repository they will not be ignored once these rules are added to .gitignore. Only future .dat files will be ignored.

Exercise 5.11.9

The left button (with the picture of a clipboard) copies the full identifier of the commit to the clipboard. In the shell, git log shows the full commit identifier for each commit.

The middle button shows all of the changes that were made in that particular commit; green shaded lines indicate additions and red lines indicate removals. We can show the same thing in the shell using git diff or git diff FROM..TO (where FROM and TO are commit identifiers).

The right button lets us view all of the files in the repository at the time of that commit. To do this in the shell, we would need to check out the repository as it was at that commit using git checkout ID, where ID is the tag, branch name, or commit identifier. If we do this, we need to remember to put the repository back to the right state afterward.

Exercise 5.11.10

GitHub displays timestamps in a human-readable relative format (i.e. “22 hours ago” or “three weeks ago”). However, if we hover over the timestamp we can see the exact time at which the last change to the file occurred.

Exercise 5.11.11

Committing updates our local repository. Pushing sends any commits we have made locally that aren’t yet in the remote repository to the remote repository.

Exercise 5.11.12

When GitHub creates a README.md file while setting up a new repository, it actually creates the repository and then commits the README.md file. When we try to pull from the remote repository to our local repository, Git detects that their histories do not share a common origin and refuses to merge them.

$ git pull origin master
warning: no common commits
remote: Enumerating objects: 3, done.
remote: Counting objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/frances/eniac
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> origin/master
fatal: refusing to merge unrelated histories

We can force git to merge the two repositories with the option --allow-unrelated-histories. Please check the contents of the local and remote repositories carefully before doing this.

Exercise 5.11.13

The answer is (5)-Both 2 and 4.

The checkout command restores files from the repository, overwriting the files in our working directory. Answers 2 and 4 both restore the latest version in the repository of the file data_cruncher.sh. Answer 2 uses HEAD to indicate the latest, while answer 4 uses the unique ID of the last commit, which is what HEAD means.

Answer 3 gets the version of data_cruncher.sh from the commit before HEAD, which is not what we want.

Answer 1 can be dangerous: without a filename, git checkout will restore all files in the current directory (and all directories below it) to their state at the commit specified. This command will restore data_cruncher.sh to the latest commit version, but will also reset any other files we have changed to that version, which will erase any unsaved changes you may have made to those files.

Exercise 5.11.14

The answer is 2.

The command git add history.txt adds the current version of history.txt to the staging area. The changes to the file from the second echo command are only applied to the working copy, not the version in the staging area.

As a result, when git commit -m "Origins of ENIAC" is executed, the version of history.txt committed to the repository is the one from the staging area with only one line.

However, the working copy still has the second line. (git status will show that the file is modified.) git checkout HEAD history.txt therefore replaces the working copy with the most recently committed version of history.txt, so cat history.txt prints:

ENIAC was the world's first general-purpose electronic computer.

Exercise 5.11.15

  1. git diff HEAD~9 bin/plotcounts.py compares what has changed between the current bin/plotcounts.py and the same file 9 commits ago.
  2. The previous git command takes the state of the file at each point and then compares them.
  3. git diff HEAD bin/plotcounts.py compares what has been changed in bin/plotcounts.py with the previous commit.

Exercise 5.11.16

Using git checkout on a staged file does not unstage it. That’s because the changes are in the staging area and checkout would affect the working directory.

Exercise 5.11.17

Each line of output corresponds to a line in the file, who was last to modify the line, when that change was made, and what the file name was or is called.

Chapter 6

Exercise 6.12.1

  1. --oneline shows each commit on a single line with the short identifier at the start and the title of the commit beside it. -n NUMBER limits the number of commits to show.

  2. --since and --after can be used to show commits in a range of dates or times; --author can be used to show commits by a particular person; and -w tells Git to ignore whitespace when comparing commits.

Exercise 6.12.2

An online search for “show Git branch in Bash prompt” turns up several approaches, one of the simplest of which is to add this line to our ~/.bashrc file:

export PS1="\\w + \$(git branch 2>/dev/null | grep '^*' | colrm 1 2) \$ "

Breaking it down:

  1. Setting the PS1 variable defines the primary shell prompt.

  2. \\w in a shell prompt string means “the current directory”.

  3. The + is a literal + sign between the current directory and the Git branch name.

  4. The command that gets the name of the current Git branch is in $(...). (We need to escape the $ as \$ so Bash doesn’t just run it once when defining the string.)

  5. The git branch command shows all the branches, so we pipe that to grep and select the one marked with a *.

  6. Finally, we remove the first column (i.e., the one containing the *) to leave just the branch name.

So what’s 2>/dev/null about? That redirects any error messages to /dev/null, a special “file” that consumes input without saving it. We need that because sometimes we will be in a directory that isn’t inside a Git repository, and we don’t want error messages showing up in our shell prompt.

None of this is obvious, and we didn’t figure it out ourselves. Instead, we did a search and pasted various answers into explainshell.com until we had something we understood and trusted.

Exercise 6.12.3

https://github.com/github/gitignore/blob/master/Python.gitignore ignores 76 files or patterns. Of those, we recognized less than half. Searching online for some of these, like "*.pot file", turns up useful explanations. Searching for others like var/ does not; in that case, we have to look at the category (in this case, “Python distribution”) and set aside time to do more reading.

Exercise 6.12.4

  1. git diff master..same does not print anything because there are no differences between the two branches.

  2. git merge same master prints merging because Git combines histories even when the files themselves do not differ. After running this command, git history shows a commit for the merge.

Exercise 6.12.5

  1. Git refuses to delete a branch with unmerged commits because it doesn’t want to destroy our work.

  2. Using the -D (capital-D) option to git branch will delete the branch anyway. This is dangerous because any content that exists only in that branch will be lost.

  3. Even with -D, git branch will not delete the branch we are currently on.

Exercise 6.12.6

  1. Chartreuse has repositories on GitHub and their desktop containing identical copies of README.md and nothing else.
  2. Fuchsia has repositories on GitHub and their desktop with exactly the same content as Chartreuse’s repositories.
  3. fuchsia.txt is in both of Fuchsia’s repositories but not in Chartreuse’s repositories.
  4. fuchsia.txt is still in both of Fuchsia’s repositories but still not in Chartreuse’s repositories.
  5. chartreuse.txt is in both of Chartreuse’s repositories but not yet in either of Fuchsia’s repositories.
  6. chartreuse.txt is in Fuchsia’s desktop repository but not yet in their GitHub repository.
  7. chartreuse.txt is in both of Fuchsia’s repositories.
  8. fuchsia.txt is in Chartreuse’s GitHub repository but not in their desktop repository.
  9. All four repositories contain both fuchsia.txt and chartreuse.txt.

Chapter 7

Exercise 7.14.2

The CONDUCT.md file should have contents that mimic those given in Section 7.3.

Exercise 7.14.3

The newly created LICENSE.md should have something like this (if MIT was chosen):

# MIT License

Copyright (c) YYYY YOUR NAME 

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Exercise 7.14.4

The text in the README.md might look something like:

## Contributing

Interested in contributing? Check out the [CONTRIBUTING.md](CONTRIBUTING.md)
file for guidelines on how to contribute. Please note that this project
is released with a [Contributor Code of Conduct](CODE_OF_CONDUCT.md). By
contributing to this project, you agree to abide by its terms.

Your CONTRIBUTING.md file might look something like the following:

# Contributing

Thank you for your interest in contributing to the Zipf's Law package!

If you are new to the package and/or collaborative code development on GitHub,
feel free to discuss any suggested changes via issue or email.
We can then walk you through the pull request process if need be.
As the project grows,
we intend to develop more detailed guidelines for submitting
bug reports and feature requests.

We also have a code of conduct (see [`CONDUCT.md`](CONDUCT.md)).
Please follow it in all your interactions with the project.

Exercise 7.14.5

Be sure to tag the new issue as a feature request to help triage.

Exercise 7.14.6

We often delete the duplicate label: when we mark an issue that way, we (almost) always add a comment saying which issue it’s a duplicate of, in which case it’s just as sensible to label the issue wontfix.

Exercise 7.14.7

Some solutions could be:

  • Give the team member their own office space so they don’t distract others.
  • Buy noise cancelling headphones for the employees that find it distracting.
  • Re-arrange the work spaces so that there is a “quiet office” and a regular office space and have the team member with the attention disorder work in the regular office.

Exercise 7.14.8

Possible solutions:

  • Change the rule so that anyone who contributes to the project, in any way, gets included as a co-author.
  • Update the rule to include a contributor list on all projects with descriptions of duties, roles, and tasks the contributor provided for the project.

Exercise 7.14.9

We obviously can’t say which description fits you best, but:

  • Use three sticky notes and interruption bingo to stop Anna from cutting people off.

  • Tell Bao that the devil doesn’t need more advocates, and that he’s only allowed one “but what about” at a time.

  • Hediyeh’s lack of self-confidence will take a long time to remedy. Keeping a list of the times she’s been right and reminding her of them frequently is a start, but the real fix is to create and maintain a supportive environment.

  • Unmasking Kenny’s hitchhiking wll feel like nit-picking, but so does the accounting required to pin down other forms of fraud. The most important thing is to have the discussion in the open so that everyone realizes he’s taking credit for everyone else’s work as well as theirs.

  • Melissa needs a running partner—someone to work beside her so that she starts when she should and doesn’t get distracted. If that doesn’t work, the project may need to assign everything mission-critical to someone else (which will probably lead to her leaving).

  • Petra can be managed with a one-for-one rule: each time she builds or fixes something that someone else needs, she can then work on something she thinks is cool. However, she’s only allowed to add whatever it is to the project if someone else will publicly commit to maintaining it.

  • Get Frank and Raj off your project as quickly as you can.

Chapter 8

Exercise 8.11.1

make -n target will show commands without running them.

Exercise 8.11.2

  1. The -B option makes everything, even files that aren’t out of date.

  2. The -C option tells Make to change directories before executing, so that make -C ~/myproject runs Make in ~/myproject regardless of the directory it is invoked from.

  3. By default, Make looks for (and runs) a file called Makefile or makefile. If you use another name for your Makefile (which is necessary if you have multiple Makefiles in the same directory), then you need to specify the name of that Makefile using the -f option.

Exercise 8.11.3

mkdir -p some/path makes one or more nested directories if they don’t exist, and does nothing (without complaining) if they already exist. It is useful for creating the output directories for build rules.

Exercise 8.11.4

The build rule for generated the result for any book should now be,

## results/%.csv : regenerate result for any book.
results/%.csv : data/%.txt $(COUNT)
    @bash $(SUMMARY) $<
    python $(COUNT) $< > $@

where SUMMARY is defined earlier in the Makefile as

SUMMARY=bin/book_summary.sh

and the settings build rule now includes:

@echo SUMMARY: $(SUMMARY)

Exercise 8.11.5

Since we already have a variable RESULTS that contains all of the results files, all we need is a phony target that depends on them:

.PHONY: results # and all the other phony targets

## results : regenerate result for all books.
results : ${RESULTS}

Exercise 8.11.6

If we use a shell wildcard in a rule like this:

results/collated.csv : results/*.csv
    python $(COLLATE) $^ > $@

then if results/collated.csv already exists, the rule tells Make that the file depends on itself.

Exercise 8.11.7

Our rule is:

help :
        @grep -h -E '^##' ${MAKEFILE_LIST} | sed -e 's/## //g' | column -t -s ':'
  • The -h option to grep tells it not to print filenames, while the -E option tells it to interprets ^## as a pattern.

  • MAKEFILE_LIST is an automatically-defined variable with the names of all the Makefiles in play. (There might be more than one because Makefiles can include other Makefiles.)

  • sed can be used to do string substitution.

  • column formats text nicely in columns.

Exercise 8.12

The new config.mk file reads as follows:

COUNT=bin/countwords.py
COLLATE=bin/collate.py
PLOT=bin/plotcounts.py

The contents of that file can then be included in the Makefile:

.PHONY: results all clean help settings

include config.mk

# ... the rest of the Makefile

Chapter 9

Exercise 9.8.1

The build rule involving plotcounts.py should now read,

## results/collated.png: plot the collated results.
results/collated.png : results/collated.csv $(PARAMS)
    python $(PLOT) $< --outfile $@ --plotparams $(word 2,$^)

where PARAMS is defined earlier in the Makefile along with all the other variables and also included later in the settings build rule:

COUNT=bin/countwords.py
COLLATE=bin/collate.py
PARAMS=bin/plotparams.yml
PLOT=bin/plotcounts.py
SUMMARY=bin/book_summary.sh
DATA=$(wildcard data/*.txt)
RESULTS=$(patsubst data/%.txt,results/%.csv,$(DATA))
## settings : show variables' values.
settings :
    @echo COUNT: $(COUNT)
    @echo DATA: $(DATA)
    @echo RESULTS: $(RESULTS)
    @echo COLLATE: $(COLLATE)
    @echo PARAMS: $(PARAMS)
    @echo PLOT: $(PLOT)
    @echo SUMMARY: $(SUMMARY)

Exercise 9.8.2

  1. Make the following additions to plotcounts.py:

Import matplotlib.pyplot:

import matplotlib.pyplot as plt

Define the new --style option:

parser.add_argument('--style', type=str, choices=plt.style.available,
                    default=None, help='matplotlib style')

Use the style at the top of the `main function:

def main(args):
    """Run the command line program."""
    if args.style:
        plt.style.use(args.style)
  1. Add nargs='*' to the definition of the --style option:
parser.add_argument('--style', type=str, nargs='*', choices=plt.style.available,
                    default=None, help='matplotlib style')

Exercise 9.8.3

FIXME: GVW to write solution

Exercise 9.8.4

FIXME

Exercise 9.8.5

import configparser

def set_plot_params(param_file):
    """Set the matplotlib parameters."""
    if param_file:
        config = configparser.ConfigParser()
        config.read(param_file)
        for section in config.sections():
            for param in config[section]:
                value = config[section][param]
                mpl.rcParams[param] = value

TODO: Add answers to the following questions:

  • Which file format do you find easier to work with?
  • What other factors should influence your choice of a configuration file syntax?

Exercise 9.8.6

FIXME

Chapter 10

Exercise 10.6.1

Add a new command line argument to collate.py,

parser.add_argument('-v', '--verbose', action="store_true", default=False,
                    help="Change logging threshold from WARNING to DEBUG")

and two new lines to the beginning of the main function,

log_level = logging.DEBUG if args.verbose else logging.WARNING
logging.basicConfig(level=log_level)

such that the full collate.py script now reads as follows:

"""Combine multiple word count CSV-files into a single cumulative count."""
import csv
import argparse
from collections import Counter
import logging
import utilities


ERROR_MESSAGES = {
    'not_csv_file_suffix' : '{file_name}: The filename must end in `.csv`',
}

def update_counts(reader, word_counts):
    """Update word counts with data from another reader/file."""
    for word, count in csv.reader(reader):
        word_counts[word] += int(count)

def main(args):
    """Run the command line program."""
    log_level = logging.DEBUG if args.verbose else logging.WARNING
    logging.basicConfig(level=log_level)
    word_counts = Counter()
    logging.info('Processing files...')
    for file_name in args.infiles:
        logging.debug(f'Reading in {file_name}...')
        if file_name[-4:] != '.csv':
            raise OSError(ERROR_MESSAGES['not_csv_file_suffix'].format(file_name=file_name))
        with open(file_name, 'r') as reader:
            logging.debug('Computing word counts...')
            update_counts(reader, word_counts)
    utilities.collection_to_csv(word_counts, num=args.num)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infiles', type=str, nargs='*', help='Input file names')
    parser.add_argument('-n', '--num', type=int, default=None,
                        help='Limit output to N most frequent words')
    parser.add_argument('-v', '--verbose', action="store_true", default=False,
                        help="Change logging threshold from WARNING to DEBUG")
    args = parser.parse_args()
    main(args)

Exercise 10.6.2

Add a new command line argument to collate.py,

parser.add_argument('-l', '--logfile', type=str, default='collate.log',
                    help='Name of the log file')

and pass the name of the log file to logging.basicConfig using the filename argument,

logging.basicConfig(level=log_level, filename=args.logfile)

such that the collate.py script now reads as follows:

"""Combine multiple word count CSV-files into a single cumulative count."""
import csv
import argparse
from collections import Counter
import logging
import utilities


ERROR_MESSAGES = {
    'not_csv_file_suffix' : '{file_name}: The filename must end in `.csv`',
}

def update_counts(reader, word_counts):
    """Update word counts with data from another reader/file."""
    for word, count in csv.reader(reader):
        word_counts[word] += int(count)

def main(args):
    """Run the command line program."""
    log_level = logging.DEBUG if args.verbose else logging.WARNING
    logging.basicConfig(level=log_level, filename=args.logfile)
    word_counts = Counter()
    logging.info('Processing files...')
    for file_name in args.infiles:
        logging.debug(f'Reading in {file_name}...')
        if file_name[-4:] != '.csv':
            raise OSError(ERROR_MESSAGES['not_csv_file_suffix'].format(file_name=file_name))
        with open(file_name, 'r') as reader:
            logging.debug('Computing word counts...')
            update_counts(reader, word_counts)
    utilities.collection_to_csv(word_counts, num=args.num)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infiles', type=str, nargs='*', help='Input file names')
    parser.add_argument('-n', '--num', type=int, default=None,
                        help='Limit output to N most frequent words')
    parser.add_argument('-v', '--verbose', action="store_true", default=False,
                        help="Change logging threshold from WARNING to DEBUG")
    parser.add_argument('-l', '--logfile', type=str, default='collate.log',
                        help='Name of the log file')
    args = parser.parse_args()
    main(args)

Exercise 10.6.3

  1. The loop in collate.py that reads/processes each input file should now read as follows:
for file_name in args.infiles:
    try:
        logging.debug(f'Reading in {file_name}...')
        if file_name[-4:] != '.csv':
            raise OSError(ERROR_MESSAGES['not_csv_file_suffix'].format(file_name=file_name))
        with open(file_name, 'r') as reader:
            logging.debug('Computing word counts...')
            update_counts(reader, word_counts)
    except Exception as error:
        logging.warning(f'{file_name} not processed: {error}')
  1. The loop in collate.py that reads/processes each input file should now read as follows:
for file_name in args.infiles:
    try:
        logging.debug(f'Reading in {file_name}...')
        if file_name[-4:] != '.csv':
            raise OSError(ERROR_MESSAGES['not_csv_file_suffix'].format(file_name=file_name))
        with open(file_name, 'r') as reader:
            logging.debug('Computing word counts...')
            update_counts(reader, word_counts)
    except FileNotFoundError:
        logging.warning(f'{file_name} not processed: File does not exist')
    except PermissionError:
        logging.warning(f'{file_name} not processed: No permission to read file')
    except Exception as error:
        logging.warning(f'{file_name} not processed: {error}')

Exercise 10.6.4

  1. The convention is to use ALL_CAPS_WITH_UNDERSCORES when defining global variables.

  2. Python’s f-strings interpolate variables that are in scope: there is no easy way to interpolate values from a lookup table. In contrast, str.format can be given any number of named keyword arguments (Appendix K), so we can look up a string and then interpolate whatever values we want.

  3. Once ERROR_MESSAGES has been moved to the utilities module all references to it in collate.py must be updated to utilities.ERROR_MESSAGES.

Exercise 10.6.5

A traceback is an object that records where an exception was raised), what stack frames were on the call stack when the error occurred, and other details that are helpful for debugging. Python’s traceback library can be used to get and print information from these objects.

Chapter 11

Exercise 11.12.1

  • The first assertion checks that the input sequence values is not empty. An empty sequence such as [] will make it fail.

  • The second assertion checks that each value in the list can be turned into an integer. Input such as [1, 2,'c', 3] will make it fail.

  • The third assertion checks that the total of the list is greater than 0. Input such as [-10, 2, 3] will make it fail.

Exercise 11.12.2

We can test that the first assertion fails when values is not a non-empty list as follows:

def test_fails_for_non_list():
    try:
        total('not a list')
        assert False, 'Should have raised AssertionError'
    except AssertionError:
        pass
    except:
        assert False, 'Should have raised AssertionError'

In order:

  1. The first assert False will only happen if total runs without raising any kind of exception.

  2. The except AssertionError branch does nothing (pass) if the correct exception is raised.

  3. The catch-all except at the end makes the test fail if the wrong kind of exception is raised.

This pattern is so common that pytest provides a shorthand notation for it:

def test_fails_for_non_list():
    with pytest.raises(AssertionError):
        total('not a list')

If the call to total doesn’t raise AssertionError, this test fails.

Exercise 11.12.3

To test that the Travis status display is working correctly, try committing a test that deliberately fails to version control.

Exercise 11.12.4

If every configuration setting has a sensible default, we can create configuration files that each change exactly one parameter. Alternatively, if our program uses overlay configuration, we can have each test use a standard configuration file plus a second one that changes one setting. In either case, we can either run the program and check its output, or check that the data structure storing configuration information has the right values (and trust other tests to make sure that those settings do what they’re supposed to).

Exercise 11.12.5

There are three approaches to testing when pseudo-random numbers are involved:

  1. Run the function once with a known seed, check and record its output, and then compare the output of subsequent runs to that saved output. (Basically, if the function does the same thing it did the first time, we trust it.)

  2. Replace the pseudo-random number generator with a function of our own that generates a predictable series of values. For example, if we are randomly partitioning a list into two equal halves, we could instead use a function that puts odd-numbered values in one partition and even-numbered values in another (which is a legal but unlikely outcome of truly random partitioning).

  3. Instead of checking for an exact result, check that the result lies within certain bounds, just as we would with the result of a physical experiment.

Exercise 11.12.6

This result seems counter-intuitive to many people because relative error is a measure of a single value, but in this case we are looking at a distribution of values: each result is off by 0.1 compared to a range of 0–2, which doesn’t “feel” infinite. In this case, a better measure might be the largest absolute error divided by the standard deviation of the data.

Chapter 12

Exercise 12.4.1

You can get an ORCID by registering here. Please add this 16-digit identifier to all of your published works and to your online profiles.

Exercise 12.4.2

If possible, compare your answers with those of a colleague who works with the same data. Where did you agree and disagree, and why?

Exercise 12.4.3

  1. 51 soliciters were interviwed as the participants.

  2. Interview data and a data from a database on court decisions.

  3. This information is not available within the documentation. Information on their jobs and opinions are there, but the participant demographics are only described within the associated article. The difficulty is that the article is not linked within the documentation or the metadata.

  4. We can search the dataset name and authorname trying to find this. A search for “National Science Foundation (1228602)”, which is the grant information, finds the grant page https://www.nsf.gov/awardsearch/showAward?AWD_ID=1228602. Two articles are linked there, but both the DOI links are broken. We can search with the citation for each paper to find them. The Forced Migration article can be found at https://www.fmreview.org/fragilestates/meili but uses a different subset of interviews and does not mention demographics nor links to the deposited dataset. The Boston College Law Review article at https://lawdigitalcommons.bc.edu/cgi/viewcontent.cgi?article=3318&context=bclr has the same two problems of different data and no dataset citation.

    Searching more broadly through Meili’s work, we can find http://dx.doi.org/10.2139/ssrn.2668259. This lists the dataset as a footnote and reports the 51 interviews with demographic data on reported gender of the interviewees. This paper lists data collection as 2010-2014, while the other two say 2010-2013. We might come to a conclusion that this extra year is where the extra 9 interviews come in, but that difference is not explained anywhere.

Exercise 12.4.4

  1. The software requirements are documented in README.md. In addition to the tools used in the zipf/ project (Python, Make and Git), the project also requires ImageMagick. No information on installing ImageMagick or a required version of ImageMagick is provided.

    To re-create the conda environment you would need the file my_environment.yml. Instructions for creating and using the environment are provided in README.md.

  2. Like zipf the data processing and analysis steps are documented in a Makefile. The README includes instructions for re-creating the results using make all.

  3. There doesn’t seem to be a DOI for the archived code and data, but the GitHub repo does have a release v1.0 with the description “Published manuscript (1.0)”. A zip file of this release could be downloaded with the link https://github.com/borstlab/reversephi_paper/archive/v1.0.zip.

Exercise 12.4.6

You’ll know you’ve completed this exercise when you have a URL that points to ZIP archive for a specific release of your repository on GitHub, e.g. https://github.com/DamienIrving/zipf/archive/KhanVirtanen2020.zip

Exercise 12.4.7

Some steps to publishing your project’s code would be:

  1. Upload the code on GitHub.
  2. Use a standard folder and file structure as taught in this book.
  3. Include README, CONTRIBUTING, CONDUCT, and LICENSE files.
  4. Make sure these files explain how to install and configure the required software and tells people how to run the code in the project.
  5. Include a requirements.txt file for Python package dependencies.

Chapter 13

Exercise 13.9.1

Depending on how well the package was setup before running python setup.py sdist, there will be either very few warnings, or a lot.

Exercise 13.9.2

The new requirements_dev.txt file will have this inside it:

pytest

Exercise 13.9.3

Depending on what was done before using the checklist, there will be either very little that needs to be updated or a lot. If this course was followed, most items would be checked off of the list.

Exercise 13.9.4

FIXME

Exercise 13.9.5

Depending on how often you update your Python packages, you will have very little or a lot to update.