You—Yes, You—Should Go to SciPy 2013

The SciPy 2013 conference is coming up on June 24-29 in Austin, Texas and you should go. Here are some good reasons:

You’ll learn something. There are beginning, intermediate, and advanced tutorial tracks this year, so it would be pretty much impossible for you not to learn something at those. I’ll even be there with Katy Huff teaching a tutorial on version control and testing. Even if you can’t make it to the tutorials there will lots of great talks and BOF sessions.

There are domain specific mini-symposia. If your field is represented you can go for a concentrated dose of relevant talks and to meet other Python users in your field. Here are the specific domains this year:

  • Astronomy & astrophysics
  • Bio-informatics
  • GIS – Geospatial Data Analysis
  • Medical imaging
  • Meteorology, climatology, and atmospheric and oceanic science

It’ll be fun! The scientific Python community is chock full of really nice people. Even if you’re new and just learning how to use Python you’ll meet people who are eager to talk and make you feel welcome. (If you find this is not the case, email me or tweet me and I will see if I can help.)

Diversity at SciPy

I’ve been going to SciPy since 2010 and every year the attendees and speakers have been disappointingly white and male. Last year Andy Terrel and I chided the conference organizers about this and it looks like this year the organizers (which include Andy) are actually trying to do something about diversity: there is a Diversity Statement, a Code of Conduct, and pyladies will be there as a community sponsor.

If you’re not sure about SciPy because you’re worried you won’t fit in or won’t be welcome I want to be the first to tell you that you don’t need to worry and that you should come. Everyone who comes to SciPy has agreed to abide by the Code of Conduct and the conference organizers are there to help if you experience any problems. SciPy is a conference for everyone and having a more diverse community is good for all of us.

Data Provenance with GitPython

Data Provenance

When running scientific software it is considered a best practice to automatically record the versions of all the software you use. This practice is sometimes referred to as recording the provenance of the results and helps make your analysis more reproducible. Almost all software libraries will have a version number that you can somehow access from your own software. For example, NumPy’s version number is recorded in the variable numpy.__version__ and most Python packages will having something similar. Python’s version is in the variable sys.version (and, alternatively, sys.version_info).

However, a lot of personal or lab software doesn’t have a version number. The software might change so fast and be modified by so many people that manually incrememented version numbers aren’t very practical. There’s still hope in this situation, though, if the software is under version control. (Your software is under version control, isn’t it?) In Subversion the keyword properties feature is often used to record provenance. There isn’t a compatible feature in Git, but for Python software in Git repositories we can engineer a provenance solution using the GitPython package.

Returning to Previous States with Git

When you make a commit in Git the state of the repository is recorded and given a label based on a hash of the commit data. We can use the commit hash to return to any recorded state of the repository using the “git checkout” command. This means that if you know the commit hash of your software when you created a certain set of results, you can always set your software back to that state to reproduce the same results. Very handy!

Recording the Commit Hash

When you import a Python module, code at the global level of the module is actually executed. This is often used to set global variables within the module, which is what we’ll do here. GitPython lets us interact with Git repos from Python and one thing we can do is query a repo to get the commit hash of the current “HEAD“. (HEAD is a label in Git pointing to the latest commit of whatever state the repository is currently in.)

What we can do with that is make it so that when our software modules are imported they set a global variable containing the commit hash of their HEAD at the time the software was run. That hash can then be inserted into data products as a record of the software version used to create them. Here’s some code that gets and stores the hash of the HEAD of a repo:

from git import Repo
MODULE_HASH = Repo('/path/to/repo/').head.commit.hexsha

If the module we’re importing is actually inside a Git repo we can use a bit of Python magic to get the HEAD hash without manually listing the path to the repo:

import os.path
from git import Repo
repo_dir = os.path.abspath(os.path.dirname(__file__))
MODULE_HASH = Repo(repo_dir).head.commit.hexsha

(__file__ is a global variable Python automatically sets in imported modules.)

Versioned Data

Some data formats, especially those that are text based, can be easily stored in version control. If you can put your data in a Git repo then the same strategy as above can be used to get and store the HEAD commit of the data repo when you run your analysis, allowing you to reproduce both your software and data states during later runs. If your data does not easily fit into Git it’s still a good idea to record a unique identifier for the dataset, but you may need to develop that yourself (such as a simple list of all the data files that were used as inputs).

Install Scientific Python on Mac OS X

These instructions detail how I install the scientific Python stack on my Mac. You can always check the Install Python page for other installation options.

I’m running the latest OS X Mountain Lion (10.8) but I think these instructions should work back to Snow Leopard (10.6). These instructions differ from my previous set primarily in that I now use Homebrew to install NumPy, SciPy, and matplotlib. I do this because Homebrew makes it easier to compile these with non-standard options that work around an issue with SciPy on OS X.

I’ll show how I install Python and the basic scientific Python stack:

If you need other libraries they can most likely be installed via pip and any dependencies can probably be installed via Homebrew.

Command Line Tools

The first order of business is to install the Apple command line tools. These include important things like development headers, gcc, and git. Head over to developer.apple.com/downloads, register for a free account, and download (then install) the latest “Command Line Tools for Xcode” for your version of OS X.

If you’ve already installed Xcode on Lion or Mountain Lion then you can install the command line tools from the preferences. If you’ve installed Xcode on Snow Leopard then you already have the command line tools.

Homebrew

Homebrew is my favorite package manager for OS X. It builds packages from source, intelligently re-uses libraries that are already part of OS X, and encourages best practices like installing Python packages with pip.

To install Homebrew paste the following in a terminal:

ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"

The brew command and any executables it installs will go in the directory /usr/bin/local so you want to make sure that goes at the front of your system’s PATH. As long as you’re at it, you can also add the directory where Python scripts get installed. Add the following line to your .profile, .bash_profile, or .bashrc file:

export PATH=/usr/local/bin:/usr/local/share/python:$PATH

At this point you should close your terminal and open a new one so that this PATH setting is in effect for the rest of the installation.

Python

Now you can use brew to install Python:

brew install python

Afterwards you should be able to run the commands

which python
which pip

and see

/usr/local/bin/python
/usr/local/bin/pip

for each, respectively. (It’s also possible to install Python 3 using Homebrew: brew install python3.)

NumPy

It is possible to use pip to install NumPy, but I use a Homebrew recipe so I avoid some problems with SciPy. The recipe isn’t included in stock Homebrew though, it requires “tapping” two other sources of Homebrew formula:

brew tap homebrew/science
brew tap samueljohn/python

You can learn more about these at their respective repositories:

With those repos tapped you can almost install NumPy, but first you’ll have
to use pip to install nose:

pip install nose

I compile NumPy against OpenBLAS to avoid a SciPy issue. Compiling OpenBLAS requires gfortran, which you can get via Homebrew:

brew install gfortran
brew install numpy --with-openblas

SciPy

And then you’re ready for SciPy:

brew install scipy --with-openblas

matplotlib

matplotlib generally installs just fine via pip but the custom Homebrew formula takes care of installing optional dependencies too:

brew install matplotlib

IPython

You’ll want Notebook support with IPython and that requires some extra dependencies, including ZeroMQ via brew:

brew install zeromq
pip install jinja2
pip install tornado
pip install pyzmq
pip install ipython

pandas

Pandas should install via pip:

pip install pandas

Testing It Out

The most basic test you can do to make sure everything worked is open up an IPython session and type in the following:

import numpy
import scipy
import matplotlib
import pandas

If there are no errors then you’re ready to get started! Congratulations and enjoy!

PyCon 2013 Review

PyCon 2013 was my first PyCon and it was, bar none, the best conference I’ve ever been to. And it wasn’t just the free Raspberry Pi or the Wreck-it-Ralph swag from Disney or the fact that I stood next to Guido for a minute during the poster session. No, PyCon is just good people. The Python community is diverse and accepting, and I can’t list all the awesome, kind people I met there.

There were, unfortunately, disappointments, but what other tech conference has a sold-out full-day education summit, or raises $10k for a community group, or raises $6k for cancer research and the John Hunter Memorial fund with a 5k fun run? And PyCon attendees were 20% women! It’s amazing to have been a part of conference where community, generosity, and outreach were put front and center. I tried to do my small part by giving people directions during the tutorials.

Anyway, on to the specifics of what I did:

Tutorials

The first tutorial I went to was called “A beginner’s introduction to Pydata: how to build a minimal recommendation engine”. The intent of the tutorial was to introduce NumPy and pandas. I was hoping to learn some pandas-fu but I found the material poorly organized and didn’t feel like I was getting a good idea of why/when to use particular pandas features. The video for this one doesn’t seem to be up yet.

The second tutorial I went to was called “Bayesian statistics made simple” and this one was awesome! I was comfortable with Bayesian stats beforehand but a refresher never hurts and the instructor (Allen Downey) gave terrific explanations. He had a little Bayesian stats library for us to use in the programmatic examples, which was fun. (Though I had to re-compile NumPy and SciPy to get it to work. It used the one little corner of SciPy that’s often broken on Macs.) If you’re interested in learning more Downey is working on a new book called Think Bayes that you can read for free, Fernando Perez has posted his notebook from the course, and you can watch the video.

Education Summit

The PyCon Education Summit brought together educators from all kinds of backgrounds from K-12 teachers to those teaching adults. I went due to my interest as an instructor for Software Carpentry. Most of the discussion focused on teaching Python/computation in long-form courses to people who have zero programming experience.

I didn’t take much concrete away from the summit, but I was impressed with the sheer level of energy going into the Python/education nexus. There are many people out there experimenting with Python in education and developing lessons that use Python. There are also a lot of user groups around the country (like the Boston Python Workshop) that are actively working to bring new people into the Python world. Many people do this in their spare time! That’s the kind of community devotion I love about Python.

I gave a five minute lightning talk at the summit that was part a preview of my PyCon talk and part showing off ipythonblocks. The Notebooks for that are at nbviewer.ipython.org/5165758.

Talks

The first and most important thing you should know about the talks is that they were all recorded and the videos are online. There were about a million concurrent talk sessions and I’m still catching up on all the great stuff I missed. I highly recommend starting with the opening/closing statements from Jesse Noller and the Raspberry Pi keynote from Eben Upton:

I think there were standing ovations during each of those. And then there were the great regular talks I saw in person:

  • Python: A Toy Language by Dave Beazley
    • Do not miss a chance to see Dave Beazley talk. You will be thoroughly entertained and leave wondering why you do such boring things with your code. Here Beazley talks about using Python to control a hobby CNC mill.
  • How the Internet works by Jessica McKellar
    • Learn about the underlying structure and protocol of the web!
  • Awesome Big Data Algorithms by Titus Brown
    • Titus gives a great introduction to some algorithms and data structures that help deal with Data of Unusual Size. Also check out his blog post on the talk with links to his notebooks.
  • Who’s there? – Home Automation with Arduino/Raspberry Pi by Rupa Dachere
    • Rupa tells us how she built an automated front door camera. This talk was standing room only!
  • What Teachers Really Need from Us by Selena Deckelmann
    • Selena relates her experience getting to know teachers and how we as developers can best help them.

My Talk

I gave a talk titled “Teaching with the IPython Notebook” that focused on how the IPython Notebook can help students learning Python. (Primarily by simplifying their interface to Python.) It seemed to go well and I’m really glad I did it! The video is up and my presentation notebook is at nbviewer.ipython.org/5165431.

Posters

I stopped by Simeon Franklin’s poster about making Python more beginner friendly and I was really impressed with the level of interest surrounding the topic. Even Guido was there seriously engaged in this discussion. With engagement of this magnitude at that level I think we’ll see people putting serious effort into making Python more user friendly right out of the box, which will be wonderful.

Observations

As Wes McKinney noted on Twitter, there were two things everywhere at PyCon this year: the IPython Notebook and Raspberry Pis. It seemed like every other talk and tutorial was using the Notebook and it’s no surprise, the Notebook is so fantastic for presenting code plus supporting material and then sharing it. It’s a major boon to Python.

Everyone at the conference (plus some kids who came for free tutorials) left with a Raspberry Pi. These amazing little computers enable all kinds of projects, often attached to an Arduino for talking to hardware. In Eben Upton’s keynote I learned that the “Pi” in “Raspberry Pi” is for Python since much of the system is built on Python. The site raspberry.io has been set up as a community of projects that use RPis but I’m sure Googling turns up a ton more. A small, cheap, low powered, easy to program computer just has so many possibilities! I haven’t had a chance to start hacking on mine yet but I’m looking forward to it!

Thanks

A big thanks to STScI for sending me. Thanks to Greg Wilson for suggesting the talk idea and thanks to Titus Brown and Ethan White for looking over my proposal.