More Commits via the GitHub API

I wrote a bit ago about making commits via the GitHub API. That post outlined making changes in two simplified situations: making changes to a single file and making updates to two existing files at the root of the repository. Here I show a more general solution that allows arbitrary changes anywhere in the repo.

I want to be able to specify a repo and branch and say "here are the contents of files that have changed or been created and here are the names of files that have been deleted, please take all that and this message and make a new commit for me." Because the GitHub API is so rudimentary when it comes to making commits that will end up being a many-stepped process, but it’s mostly the same steps repeated many times so it’s not a nightmare to code up. At a high level the process goes like this:

  • Get the current repo state from GitHub
    • This is the names and hashes of all the files and directories, but not the actual file contents.
  • Construct a local, malleable representation of the repo
  • Modify the local representation according to the given updates, creations, and deletions
  • Walk though the modified local "repo" and upload new/changed files and directories to GitHub
    • This must be done from the bottom up because a change at the low level means every directory above that level will need to be changed.
  • Make a new commit pointed at the new root tree (I’ll explain trees soon.)
  • Update the working branch to point to the new commit

This blob post is readable as an IPython Notebook at I’ve also reproduced the notebook below. Read More »

Making Commits via the GitHub API

For fun I’ve been learning a bit about the GitHub API. Using the API it’s possible to do just about everything you can do on GitHub itself, from commenting on PRs to adding commits to a repo. Here I’m going to show how to do add commits to a repo on GitHub. A notebook demonstrating things with code is available here, but you may want to read this post first for the high level view.

Choosing a Client Library

The GitHub API is an HTTP interface so you can talk to it via any tool that speaks HTTP, including things like curl. To make programming with the API simpler there are a number of libraries that allow communicate with GitHub via means native to whatever language you’re using. I’m using Python and I went with the library based on its Python 3 compatibility, active development, and good documentation.

Making Commits

The repository api is the gateway for doing anything to a repo. In this is corresponds to the repository module.

Modifying a Single File

The special case of making a commit affecting a single file is much simpler than affecting multiple files. Creating, updating, and deleting a file can be done via a single API call once you have enough information to specify what you want done.

Modifying Multiple Files

Making a commit affecting multiple files requires making multiple API calls and some understanding of Git’s internal data store. That’s because to change multiple files you have to add all the changes to the repo one at a time before making a commit. The process is outlined in full in the API docs about Git data.

I should note that I think deleting multiple files in a single commit requires a slightly different procedure, one I’ll cover in another post.

That’s the overview, look over the notebook for the code!

Data Provenance with GitPython

Data Provenance

When running scientific software it is considered a best practice to automatically record the versions of all the software you use. This practice is sometimes referred to as recording the provenance of the results and helps make your analysis more reproducible. Almost all software libraries will have a version number that you can somehow access from your own software. For example, NumPy’s version number is recorded in the variable numpy.__version__ and most Python packages will having something similar. Python’s version is in the variable sys.version (and, alternatively, sys.version_info).

However, a lot of personal or lab software doesn’t have a version number. The software might change so fast and be modified by so many people that manually incrememented version numbers aren’t very practical. There’s still hope in this situation, though, if the software is under version control. (Your software is under version control, isn’t it?) In Subversion the keyword properties feature is often used to record provenance. There isn’t a compatible feature in Git, but for Python software in Git repositories we can engineer a provenance solution using the GitPython package.

Returning to Previous States with Git

When you make a commit in Git the state of the repository is recorded and given a label based on a hash of the commit data. We can use the commit hash to return to any recorded state of the repository using the “git checkout” command. This means that if you know the commit hash of your software when you created a certain set of results, you can always set your software back to that state to reproduce the same results. Very handy!

Recording the Commit Hash

When you import a Python module, code at the global level of the module is actually executed. This is often used to set global variables within the module, which is what we’ll do here. GitPython lets us interact with Git repos from Python and one thing we can do is query a repo to get the commit hash of the current “HEAD“. (HEAD is a label in Git pointing to the latest commit of whatever state the repository is currently in.)

What we can do with that is make it so that when our software modules are imported they set a global variable containing the commit hash of their HEAD at the time the software was run. That hash can then be inserted into data products as a record of the software version used to create them. Here’s some code that gets and stores the hash of the HEAD of a repo:

from git import Repo
MODULE_HASH = Repo('/path/to/repo/').head.commit.hexsha

If the module we’re importing is actually inside a Git repo we can use a bit of Python magic to get the HEAD hash without manually listing the path to the repo:

import os.path
from git import Repo
repo_dir = os.path.abspath(os.path.dirname(__file__))
MODULE_HASH = Repo(repo_dir).head.commit.hexsha

(__file__ is a global variable Python automatically sets in imported modules.)

Versioned Data

Some data formats, especially those that are text based, can be easily stored in version control. If you can put your data in a Git repo then the same strategy as above can be used to get and store the HEAD commit of the data repo when you run your analysis, allowing you to reproduce both your software and data states during later runs. If your data does not easily fit into Git it’s still a good idea to record a unique identifier for the dataset, but you may need to develop that yourself (such as a simple list of all the data files that were used as inputs).