Introducing Palettable

March 6, 2015 jiffyclubbrewer2mpl, colorbrewer, data visualization, data viz, palettable, pythonLeave a comment

I wrote brewer2mpl a couple years ago to help people use colorbrewer2 color palettes in Python. Since then it’s expanded to include palettes from Tableau and the whimsical Wes Anderson Palettes Tumblr; and there’s plenty of room for more palettes from other sources. To encompass the growing scope, brewer2mpl has been renamed to Palettable! (Thanks to Paul Ivanov for the name.)

The Palettable API has also been updated for the IPython age. All available palettes are now loaded at import and are available for your tab-completion pleasure. Need the YlGnBu palette with nine colors? That’s now available at palettable.colorbrewer.sequential.YlGnBu_9. Reversed palettes are also available with a _r suffix.

I hope you find Palettable useful! You can find it on the web at:

Docs: https://jiffyclub.github.io/palettable/
GitHub: https://github.com/jiffyclub/palettable
PyPI: https://pypi.python.org/pypi/palettable/

P.S.: Here’s a little demo notebook.

Snakeviz 0.2

December 1, 2014December 1, 2014 jiffyclubpython snakeviz profiling5 Comments

It has been over two years since Erik Bray and I made the first release of SnakeViz 0.1, a tool for visualizing performance profiles of Python code. It had multiple performance bottlenecks, but it worked just well enough that it took me a long time to prioritize making improvements. That time has finally come around and I’m happy to announce that SnakeViz 0.2 is now available!

What’s New

The look and feel of SnakeViz remains much the same (see a screenshot), but there are some new things on the screen:

SnakeViz screenshot

Detailed function information when hovering over the visualization
Call stack list for tracking where you are when zooming the visualization
Control the depth of the displayed call tree
Limit the display of functions that take up relatively little time

Under the Hood

The first release of SnakeViz had some performance bottlenecks:

It tried to transfer a complete call tree from the server to the client as JSON
It tried to display the entire call tree in the sunburst visualization

Those limited the usefulness of SnakeViz with profiles that contained calls to a lot of functions. The version 0.2 release is an almost complete rewrite in order to make SnakeViz work with larger profiles.

The first limitation is addressed by moving the building of call trees into the client application. Profile data is passed from the server to the client in close to the same form as it’s available from Python’s pstats module. Once in the client, the profile data is used to construct call trees on demand for visualization.

The second limitation is addressed by limiting how much of the profile is visualized at once. Call trees are built only to a user specified depth and users can opt to omit functions that do not use much time from display. (The “depth” and “cutoff” controls.)

I and others have tested SnakeViz 0.2 with some fairly large profiles and found it works. You can read more about SnakeViz in the updated docs. Please give it a try! Issues can be reported on GitHub.

Testing With NumPy and Pandas

October 7, 2014October 7, 2014 jiffyclubnumpy, pandas, python, testing, unit testing9 Comments

Testing Python results is often as straightforward as assert result == expected, especially with builtin types. But that doesn’t work with NumPy or Pandas data structures because using == with those doesn’t return True or False. Instead, == results in new arrays filled with boolean values. This is useful for boolean indexing, but leads to this error when testing:

In [2]: a = np.arange(10)

In [3]: b = np.arange(10)

In [4]: assert a == b
------------------------------------------------
ValueError     Traceback (most recent call last)
<ipython-input-4-6bf76ad3480a> in <module>()
----> 1 assert a == b

ValueError: The truth value of an array with more than one element is ambiguous.
            Use a.any() or a.all()

You can check whether all the elements in two arrays are equal using the .all() method:

In [5]: (a == b).all()
Out[5]: True

But that errs if the arrays are different sizes/shapes, and the result is an uninformative True or False when they are the same size. Luckily, NumPy has this situation covered.

Library Versions

For reference, these are the versions of NumPy and Pandas I’m currently using:

In [43]: np.version.version
Out[43]: '1.9.0'

In [44]: pd.version.version
Out[44]: '0.14.1'

Testing with NumPy

NumPy has an entire module devoted to testing support. I like to import it via import numpy.testing as npt in my tests. I’ll be focusing here on two functions, assert_array_equal and assert_allclose.

`assert_array_equal`

assert_array_equal raises an AssertionError when to arrays are not exactly equal. It can take anything array-like as inputs, including lists.

In [10]: npt.assert_array_equal([1, 2, 3], [1, 2, 3])

In [11]: npt.assert_array_equal([1, 2, 3], [1, 2, 3, 4, 5])
----------------------------------------------------
AssertionError     Traceback (most recent call last)
<truncated>

AssertionError:
Arrays are not equal

(shapes (3,), (5,) mismatch)
 x: array([1, 2, 3])
 y: array([1, 2, 3, 4, 5])

In [12]: npt.assert_array_equal([1, 2, 3], [99, 2, 3])
----------------------------------------------------
AssertionError     Traceback (most recent call last)
<truncated>

AssertionError:
Arrays are not equal

(mismatch 33.33333333333333%)
 x: array([1, 2, 3])
 y: array([99,  2,  3])

The examples show how you get somewhat descriptive output when the comparisons fail, including if the shapes are mismatched and what percentage of elements differ between the two arrays.

Similar functionality is available in the array_equal function, which returns True or False instead of raising an exception.

`assert_allclose`

assert_array_equal checks for exact equality. That’s fine for integer and boolean values, but often fails with floating point values because of very slight differences in the results of values calculated different ways or on different computers. For comparing floating point values I use assert_allclose.

In [17]: npt.assert_array_equal([np.pi], [np.sqrt(np.pi) ** 2])
-------------------------------------------------------
AssertionError        Traceback (most recent call last)
<truncated>

AssertionError:
Arrays are not equal

(mismatch 100.0%)
 x: array([ 3.141593])
 y: array([ 3.141593])

In [18]: npt.assert_allclose([np.pi], [np.sqrt(np.pi) ** 2])

assert_allclose takes atol and rtol arguments for specifying the absolute and relative tolerance of the comparison. For the most part I leave these at their defaults: atol=0 and rtol=1e-07. That’s a small enough tolerance that I’m confident the numbers are quite close, but large enough to let floating point noise go through. Sometimes, though, it’s useful to choose custom tolerances. For example, I was once writing tests based on numbers I copied out of a paper. The numbers were provided to four decimal places so in my tests I used npt.assert_allclose(result, expected, atol=0.0001). Choosing appropriate tolerances for testing with assert_allclose can be tricky depending how accurate you expect your code to be. Unfortunately, I don’t have any great advise on that.

assert_allclose also has a non-assertion version: allclose.

Notes

One very handy thing about assert_array_equal (and its scalar friendly cousin assert_equal) is that it handles values like nan intelligently. Normally nan compared to anything else, even nan, results in False. That’s the official, expected behavior, but it does make testing harder. assert_array_equal handles this for you.

In [29]: (np.array([np.nan, 2, 3]) == np.array([np.nan, 2, 3])).all()
Out[29]: False

In [30]: npt.assert_array_equal([np.nan, 2, 3], [np.nan, 2, 3])

Note that array_equal and equal behave in the official manner and will always return False for comparisons to nan.

Testing with Pandas

Pandas also has a testing module, but it is apparently meant more for internal testing of Pandas itself than for Pandas users. There is no documentation page for it, but it’s still available and I use it in testing. I import it via import pandas.util.testing as pdt.

The three main things I use are assert_frame_equal, assert_series_equal, and assert_index_equal. assert_frame_equal and assert_series_equal take arguments that let you control whether the comparisons are exact or approximate, and whether to compare types in addition to value equality. By default they use an allclose-like comparison.

In [39]: s1 = pd.Series([1, 2, 3], dtype='int')

In [40]: s2 = pd.Series([1, 2, 3], dtype='float')

In [41]: pdt.assert_series_equal(s1, s2)
-------------------------------------------------------
AssertionError        Traceback (most recent call last)
<truncated>

AssertionError: attr is not equal [dtype]: dtype('int64') != dtype('float64')

In [42]: pdt.assert_series_equal(s1, s2, check_dtype=False)

assert_frame_equal is sensitive to the order of columns and rows in the tables. I’ve found this is not always what I want, sometimes it’s fine if ordering changes as long as the same column names and index labels are in both tables. I’ve made my own assert_frames_equal function for testing that case.

Just because you’re using complex data containers like arrays and DataFrames in your code doesn’t mean you can’t test it. NumPy and Pandas are themselves heavily tested and you can test your own code using the same utilities the NumPy and Pandas developers use. Happy testing!

Resources for Learning Python

April 17, 2014 jiffyclublearning python, pythonLeave a comment

Yesterday I asked my followers on Twitter for their advice on the best resources for people learning programming and Python:

Please help: best resources using Python for people who are new to programming?

— Matt Davis @jiffyclub@hachyderm.io (@jiffyclub) April 15, 2014

You can see their responses on Twitter and below.

olgabot: Code Academy
modernscientist: Learn Python the Hard Way
AlexVianaPro: Think Python
oceankidbilly: A list of resources posted on reddit
ptone: How to Think Like a Computer Scientist
pani5ue: Google’s Python Class

Of those, I think Think Python and How to Think Like a Computer Scientist are especially targetted at people who are brand new to programming in any language.

These are some of the resources I learned from back when I picked up Python, though I should note that I already knew some programming at the time:

Thanks to everyone who responded!

Docker via Homebrew

March 9, 2014 jiffyclubdocker, homebrew, mac9 Comments

Docker is a great tool for getting lightweight, isolated Linux environments. It uses technology that doesn’t work natively on Macs. Until now you’ve had to boot into a VM to install and use Docker, but it’s now a little easier than that.

As of Docker 0.8 it can be run on Macs thanks to a specially developed, lightweight VirtualBox VM. There are official instructions for installing Docker on Mac, but with Homebrew and cask it’s even easier.

Follow the instructions on the cask homepage to install it. Cask is an extension to Homebrew for installing Mac binary packages via the command line. Think things like Chrome or Steam. Or VirtualBox. Running Docker on Mac requires VirtualBox so if you don’t have it already:

brew cask install virtualbox

Then install Docker and the helper tool boot2docker:

brew install docker
brew install boot2docker

boot2docker takes care of the VM that Docker runs in. To get things started it needs to download the Docker VM and start a daemon that the docker command line tool will talk to:

boot2docker init
boot2docker up

The docker command line tool should now be able to talk to the daemon and if you run docker version you should see a report for both a server and a client. (Note: When I ran boot2docker up it told me that the default port the daemon uses was already taken. I had to specify a different port via the DOCKER_HOST environment variable, which I now set in my shell configuration.)

If everything has gone well to this point you should now be able to start up a Docker instance. This command will drop you into a bash shell in Ubuntu:

docker run -i -t ubuntu /bin/bash

Use ctrl-D to exit. I find this especially helpful for very quickly getting to a Linux command line from my Mac for testing this or that, like checking what versions of software are installing by apt-get.

Visit the Docker documentation to learn more about what you can do with Docker and how to do it.

Making Commits via the GitHub API

February 26, 2014 jiffyclubgit, github, github api, github3.py, pythonLeave a comment

For fun I’ve been learning a bit about the GitHub API. Using the API it’s possible to do just about everything you can do on GitHub itself, from commenting on PRs to adding commits to a repo. Here I’m going to show how to do add commits to a repo on GitHub. A notebook demonstrating things with code is available here, but you may want to read this post first for the high level view.

Choosing a Client Library

The GitHub API is an HTTP interface so you can talk to it via any tool that speaks HTTP, including things like curl. To make programming with the API simpler there are a number of libraries that allow communicate with GitHub via means native to whatever language you’re using. I’m using Python and I went with the github3.py library based on its Python 3 compatibility, active development, and good documentation.

Making Commits

The repository api is the gateway for doing anything to a repo. In github3.py this is corresponds to the repository module.

Modifying a Single File

The special case of making a commit affecting a single file is much simpler than affecting multiple files. Creating, updating, and deleting a file can be done via a single API call once you have enough information to specify what you want done.

Modifying Multiple Files

Making a commit affecting multiple files requires making multiple API calls and some understanding of Git’s internal data store. That’s because to change multiple files you have to add all the changes to the repo one at a time before making a commit. The process is outlined in full in the API docs about Git data.

I should note that I think deleting multiple files in a single commit requires a slightly different procedure, one I’ll cover in another post.

That’s the overview, look over the notebook for the code! http://nbviewer.ipython.org/gist/jiffyclub/9235955

Data Provenance with GitPython

April 25, 2013 jiffyclubgit, provenance, pythonLeave a comment

Data Provenance

When running scientific software it is considered a best practice to automatically record the versions of all the software you use. This practice is sometimes referred to as recording the provenance of the results and helps make your analysis more reproducible. Almost all software libraries will have a version number that you can somehow access from your own software. For example, NumPy’s version number is recorded in the variable numpy.__version__ and most Python packages will having something similar. Python’s version is in the variable sys.version (and, alternatively, sys.version_info).

However, a lot of personal or lab software doesn’t have a version number. The software might change so fast and be modified by so many people that manually incrememented version numbers aren’t very practical. There’s still hope in this situation, though, if the software is under version control. (Your software is under version control, isn’t it?) In Subversion the keyword properties feature is often used to record provenance. There isn’t a compatible feature in Git, but for Python software in Git repositories we can engineer a provenance solution using the GitPython package.

Returning to Previous States with Git

When you make a commit in Git the state of the repository is recorded and given a label based on a hash of the commit data. We can use the commit hash to return to any recorded state of the repository using the “git checkout” command. This means that if you know the commit hash of your software when you created a certain set of results, you can always set your software back to that state to reproduce the same results. Very handy!

Recording the Commit Hash

When you import a Python module, code at the global level of the module is actually executed. This is often used to set global variables within the module, which is what we’ll do here. GitPython lets us interact with Git repos from Python and one thing we can do is query a repo to get the commit hash of the current “HEAD“. (HEAD is a label in Git pointing to the latest commit of whatever state the repository is currently in.)

What we can do with that is make it so that when our software modules are imported they set a global variable containing the commit hash of their HEAD at the time the software was run. That hash can then be inserted into data products as a record of the software version used to create them. Here’s some code that gets and stores the hash of the HEAD of a repo:

from git import Repo
MODULE_HASH = Repo('/path/to/repo/').head.commit.hexsha

If the module we’re importing is actually inside a Git repo we can use a bit of Python magic to get the HEAD hash without manually listing the path to the repo:

import os.path
from git import Repo
repo_dir = os.path.abspath(os.path.dirname(__file__))
MODULE_HASH = Repo(repo_dir).head.commit.hexsha

(__file__ is a global variable Python automatically sets in imported modules.)

Versioned Data

Some data formats, especially those that are text based, can be easily stored in version control. If you can put your data in a Git repo then the same strategy as above can be used to get and store the HEAD commit of the data repo when you run your analysis, allowing you to reproduce both your software and data states during later runs. If your data does not easily fit into Git it’s still a good idea to record a unique identifier for the dataset, but you may need to develop that yourself (such as a simple list of all the data files that were used as inputs).

Some git Aliases

March 7, 2013 jiffyclubgit, git aliasesLeave a comment

These are a few git aliases I’ve made recently to make my life a little easier. The first two are for displaying the log, and the rest for merging. See the Git wiki for more on how to add aliases.

Logs

I pretty much always want the one-line log, so this is the basic version of that:

ls = log --oneline --decorate

--decorate shows any labels on commits, such as branch names. Sometimes I want the above but with the graph view turned on:

lg = log --oneline --decorate --graph

Merging

Merges in git fall into two basic categories: fast-forward merges in which the branch label is simply updated to a new commit (but no new commit is made), and all other merges in which a new commit is made with multiple parents.

By default the merge command will attempt to do a fast-forward merge. If that won’t work it will move to some other “real” merge strategy. I think the difference between fast-forward and the other merges is sufficient that they shouldn’t happen with the same command so I’ve set up aliases to separate them. First the alias for doing a fast-forward. This will fail if a fast-forward is not possible:

ff = merge --ff-only

And then an alias for forcing a non-fast-forward merge even in situations where a fast-forward would be possible:

mrg = merge --no-ff

(This is the sort of merge GitHub does when you merge a pull request.)

Pulling

The pull command is really fetch + merge so it takes most of the same options. When I do a pull I generally only want it to succeed if it’s possible to fast-forward my local branch. If git must do a real merge something is probably wrong, so I have a fast-forward-only pull alias:

ffpull = pull --ff-only

A Styled HTML Document from Markdown

February 22, 2013February 22, 2013 jiffyclubmarkdown, pythonLeave a comment

There are many, many command line converters for turning Markdown into HTML, but for the most part these make HTML fragments, not full documents with CSS styling. That’s fine most of the time (e.g. when I’m writing blog posts), but sometimes I want a full, pretty document so I can print it out (typically for presentation notes).

To fill this hole I put together a small script that converts Markdown and wraps the HTML result in a template that includes Bootstrap CSS. I set the fonts to sans-serif and monospace so that they are taken from the defaults for your browser, making it easier to use your favorite fonts.

The script requires the Python libraries Python Markdown, mdx_smartypants (a Python-Markdown extension), and Jinja2.

	#!/usr/bin/env python

	import argparse
	import sys

	import jinja2
	import markdown

	TEMPLATE = """<!DOCTYPE html>
	<html>
	<head>
	<link href="http://netdna.bootstrapcdn.com/twitter-bootstrap/2.3.0/css/bootstrap-combined.min.css" rel="stylesheet">
	<style>
	body {
	font-family: sans-serif;
	}
	code, pre {
	font-family: monospace;
	}
	h1 code,
	h2 code,
	h3 code,
	h4 code,
	h5 code,
	h6 code {
	font-size: inherit;
	}
	</style>
	</head>
	<body>
	<div class="container">
	{{content}}
	</div>
	</body>
	</html>
	"""


	def parse_args(args=None):
	d = 'Make a complete, styled HTML document from a Markdown file.'
	parser = argparse.ArgumentParser(description=d)
	parser.add_argument('mdfile', type=argparse.FileType('r'), nargs='?',
	default=sys.stdin,
	help='File to convert. Defaults to stdin.')
	parser.add_argument('-o', '--out', type=argparse.FileType('w'),
	default=sys.stdout,
	help='Output file name. Defaults to stdout.')
	return parser.parse_args(args)


	def main(args=None):
	args = parse_args(args)
	md = args.mdfile.read()
	extensions = ['extra', 'smarty']
	html = markdown.markdown(md, extensions=extensions, output_format='html5')
	doc = jinja2.Template(TEMPLATE).render(content=html)
	args.out.write(doc)


	if __name__ == '__main__':
	sys.exit(main())

view raw markdown_doc hosted with ❤ by GitHub

Approaching git from svn

February 13, 2013 jiffyclubgit, subversion, svn1 Comment

“Yield and overcome” – Lao Tsu, Tao Te Ching

Over the past year or so I have on a few occasions taught git to people
accustomed to using Subversion. These are experienced software developers so it
seems like it should be easy, but it is almost always a difficult process
filled with acrimony, resistance, and gasps. (And this from people who seem
to use Emacs, vim, and Linux without complaint.)

And yet when I teach git to people who don’t already have a preferred
version control system it seems to go pretty well. It takes time, and there
may be confusion, but we demo and practice and learn as with any topic.
There is no hate. I think I’m beginning to understand the situation:

Git and svn are different.
People do not like their workflow changed.

Git isn’t just a little different, it’s completely different. But, dear
svn users, git is not different because it doesn’t like you. Git is different
because it’s optimized to facilitate the collaboration of a number of
developers working on a rapidly changing code base. Some examples:

Branching is easy in git so that you can work in peace isolated from
upstream changes until you are ready to send your work back.
When it’s time to merge back git provides rebase to facilitate cleanly
integrating upstream changes.
Git is distributed so that you and everyone else can make all manner of
branches and experiments without dirtying up the master repo.

Git is different from svn, but it does well what it is designed to do.
This brings me back to the Lao Tsu quote from the top and a simple message
to svn users learning git: You will have to drop your svn workflow to make
the most of git, but it’s not because git is bad, it’s just different. Yield,
use git in the way it was meant to be used, and be happy.

Pen and Pants

Don't Leave Home Without Them

Programming