Introducing Palettable

I wrote brewer2mpl a couple years ago to help people use colorbrewer2 color palettes in Python. Since then it’s expanded to include palettes from Tableau and the whimsical Wes Anderson Palettes Tumblr; and there’s plenty of room for more palettes from other sources. To encompass the growing scope, brewer2mpl has been renamed to Palettable! (Thanks to Paul Ivanov for the name.)

The Palettable API has also been updated for the IPython age. All available palettes are now loaded at import and are available for your tab-completion pleasure. Need the YlGnBu palette with nine colors? That’s now available at palettable.colorbrewer.sequential.YlGnBu_9. Reversed palettes are also available with a _r suffix.

I hope you find Palettable useful! You can find it on the web at:

P.S.: Here’s a little demo notebook.

Introducing Palettable

Snakeviz 0.2

It has been over two years since Erik Bray and I made the first release of SnakeViz 0.1, a tool for visualizing performance profiles of Python code. It had multiple performance bottlenecks, but it worked just well enough that it took me a long time to prioritize making improvements. That time has finally come around and I’m happy to announce that SnakeViz 0.2 is now available!

What’s New

The look and feel of SnakeViz remains much the same (see a screenshot), but there are some new things on the screen:

SnakeViz screenshot

  • Detailed function information when hovering over the visualization
  • Call stack list for tracking where you are when zooming the visualization
  • Control the depth of the displayed call tree
  • Limit the display of functions that take up relatively little time

Under the Hood

The first release of SnakeViz had some performance bottlenecks:

  • It tried to transfer a complete call tree from the server to the client as JSON
  • It tried to display the entire call tree in the sunburst visualization

Those limited the usefulness of SnakeViz with profiles that contained calls to a lot of functions. The version 0.2 release is an almost complete rewrite in order to make SnakeViz work with larger profiles.

The first limitation is addressed by moving the building of call trees into the client application. Profile data is passed from the server to the client in close to the same form as it’s available from Python’s pstats module. Once in the client, the profile data is used to construct call trees on demand for visualization.

The second limitation is addressed by limiting how much of the profile is visualized at once. Call trees are built only to a user specified depth and users can opt to omit functions that do not use much time from display. (The “depth” and “cutoff” controls.)


I and others have tested SnakeViz 0.2 with some fairly large profiles and found it works. You can read more about SnakeViz in the updated docs. Please give it a try! Issues can be reported on GitHub.

Snakeviz 0.2

State of Conda, Oct. 2014

I have been using Conda (via Miniconda) for managing my Python development environments and packages for close to a year now, so I thought I’d write up my thoughts so far for others.

Conda is both an environment manager (an alternative to virtualenv) and an installation tool (an alternative to pip). You can also use Conda to build your packages and distribute them via Binstar.

So, what does Conda do well and what needs improvement? Continue reading “State of Conda, Oct. 2014”

State of Conda, Oct. 2014

Testing With NumPy and Pandas

Testing Python results is often as straightforward as assert result == expected, especially with builtin types. But that doesn’t work with NumPy or Pandas data structures because using == with those doesn’t return True or False. Instead, == results in new arrays filled with boolean values. This is useful for boolean indexing, but leads to this error when testing:

In [2]: a = np.arange(10)

In [3]: b = np.arange(10)

In [4]: assert a == b
------------------------------------------------
ValueError     Traceback (most recent call last)
<ipython-input-4-6bf76ad3480a> in <module>()
----> 1 assert a == b

ValueError: The truth value of an array with more than one element is ambiguous.
            Use a.any() or a.all()

You can check whether all the elements in two arrays are equal using the .all() method:

In [5]: (a == b).all()
Out[5]: True

But that errs if the arrays are different sizes/shapes, and the result is an uninformative True or False when they are the same size. Luckily, NumPy has this situation covered.

Library Versions

For reference, these are the versions of NumPy and Pandas I’m currently using:

In [43]: np.version.version
Out[43]: '1.9.0'

In [44]: pd.version.version
Out[44]: '0.14.1'

Testing with NumPy

NumPy has an entire module devoted to testing support. I like to import it via import numpy.testing as npt in my tests. I’ll be focusing here on two functions, assert_array_equal and assert_allclose.

assert_array_equal

assert_array_equal raises an AssertionError when to arrays are not exactly equal. It can take anything array-like as inputs, including lists.

In [10]: npt.assert_array_equal([1, 2, 3], [1, 2, 3])

In [11]: npt.assert_array_equal([1, 2, 3], [1, 2, 3, 4, 5])
----------------------------------------------------
AssertionError     Traceback (most recent call last)
<truncated>

AssertionError:
Arrays are not equal

(shapes (3,), (5,) mismatch)
 x: array([1, 2, 3])
 y: array([1, 2, 3, 4, 5])

In [12]: npt.assert_array_equal([1, 2, 3], [99, 2, 3])
----------------------------------------------------
AssertionError     Traceback (most recent call last)
<truncated>

AssertionError:
Arrays are not equal

(mismatch 33.33333333333333%)
 x: array([1, 2, 3])
 y: array([99,  2,  3])

The examples show how you get somewhat descriptive output when the comparisons fail, including if the shapes are mismatched and what percentage of elements differ between the two arrays.

Similar functionality is available in the array_equal function, which returns True or False instead of raising an exception.

assert_allclose

assert_array_equal checks for exact equality. That’s fine for integer and boolean values, but often fails with floating point values because of very slight differences in the results of values calculated different ways or on different computers. For comparing floating point values I use assert_allclose.

In [17]: npt.assert_array_equal([np.pi], [np.sqrt(np.pi) ** 2])
-------------------------------------------------------
AssertionError        Traceback (most recent call last)
<truncated>

AssertionError:
Arrays are not equal

(mismatch 100.0%)
 x: array([ 3.141593])
 y: array([ 3.141593])

In [18]: npt.assert_allclose([np.pi], [np.sqrt(np.pi) ** 2])

assert_allclose takes atol and rtol arguments for specifying the absolute and relative tolerance of the comparison. For the most part I leave these at their defaults: atol=0 and rtol=1e-07. That’s a small enough tolerance that I’m confident the numbers are quite close, but large enough to let floating point noise go through. Sometimes, though, it’s useful to choose custom tolerances. For example, I was once writing tests based on numbers I copied out of a paper. The numbers were provided to four decimal places so in my tests I used npt.assert_allclose(result, expected, atol=0.0001). Choosing appropriate tolerances for testing with assert_allclose can be tricky depending how accurate you expect your code to be. Unfortunately, I don’t have any great advise on that.

assert_allclose also has a non-assertion version: allclose.

Notes

One very handy thing about assert_array_equal (and its scalar friendly cousin assert_equal) is that it handles values like nan intelligently. Normally nan compared to anything else, even nan, results in False. That’s the official, expected behavior, but it does make testing harder. assert_array_equal handles this for you.

In [29]: (np.array([np.nan, 2, 3]) == np.array([np.nan, 2, 3])).all()
Out[29]: False

In [30]: npt.assert_array_equal([np.nan, 2, 3], [np.nan, 2, 3])

Note that array_equal and equal behave in the official manner and will always return False for comparisons to nan.

Testing with Pandas

Pandas also has a testing module, but it is apparently meant more for internal testing of Pandas itself than for Pandas users. There is no documentation page for it, but it’s still available and I use it in testing. I import it via import pandas.util.testing as pdt.

The three main things I use are assert_frame_equal, assert_series_equal, and assert_index_equal. assert_frame_equal and assert_series_equal take arguments that let you control whether the comparisons are exact or approximate, and whether to compare types in addition to value equality. By default they use an allclose-like comparison.

In [39]: s1 = pd.Series([1, 2, 3], dtype='int')

In [40]: s2 = pd.Series([1, 2, 3], dtype='float')

In [41]: pdt.assert_series_equal(s1, s2)
-------------------------------------------------------
AssertionError        Traceback (most recent call last)
<truncated>

AssertionError: attr is not equal [dtype]: dtype('int64') != dtype('float64')

In [42]: pdt.assert_series_equal(s1, s2, check_dtype=False)

assert_frame_equal is sensitive to the order of columns and rows in the tables. I’ve found this is not always what I want, sometimes it’s fine if ordering changes as long as the same column names and index labels are in both tables. I’ve made my own assert_frames_equal function for testing that case.


Just because you’re using complex data containers like arrays and DataFrames in your code doesn’t mean you can’t test it. NumPy and Pandas are themselves heavily tested and you can test your own code using the same utilities the NumPy and Pandas developers use. Happy testing!

Testing With NumPy and Pandas

Performance of Pandas Series vs NumPy Arrays

I recently spent a day working on the performance of a Python function and learned a bit about Pandas and NumPy array indexing. The function is iterative, looping over data and updating some row weights until it meets convergence criteria. I tried to do as much processing as I could before the loops, but some indexing (and of course arithmetic) had to stay inside the loops.

When I looked at profiles of the function almost all of the time was being spent doing indexing on Pandas Series objects. A quick investigation shows that indexing Series objects is quite slow compared to NumPy arrays. First, some setup: Continue reading “Performance of Pandas Series vs NumPy Arrays”

Performance of Pandas Series vs NumPy Arrays

If You Want to Build the NumPy and SciPy Docs

This week docs.scipy.org has been down, but folks still need their NumPy and SciPy docs. To fill the gap until docs.scipy.org is back up I built the docs for only the latest stable releases and uploaded them to GitHub pages:

How to Build

(Note that I’m working on a Mac and these instructions are a little Mac/Linux oriented. The procedure on Windows would not be drastically different, though.)

Continue reading “If You Want to Build the NumPy and SciPy Docs”

If You Want to Build the NumPy and SciPy Docs

Resources for Learning Python

Yesterday I asked my followers on Twitter for their advice on the best resources for people learning programming and Python:

You can see their responses on Twitter and below.

Of those, I think Think Python and How to Think Like a Computer Scientist are especially targetted at people who are brand new to programming in any language.

These are some of the resources I learned from back when I picked up Python, though I should note that I already knew some programming at the time:

Thanks to everyone who responded!

Resources for Learning Python

More Commits via the GitHub API

I wrote a bit ago about making commits via the GitHub API. That post outlined making changes in two simplified situations: making changes to a single file and making updates to two existing files at the root of the repository. Here I show a more general solution that allows arbitrary changes anywhere in the repo.

I want to be able to specify a repo and branch and say "here are the contents of files that have changed or been created and here are the names of files that have been deleted, please take all that and this message and make a new commit for me." Because the GitHub API is so rudimentary when it comes to making commits that will end up being a many-stepped process, but it’s mostly the same steps repeated many times so it’s not a nightmare to code up. At a high level the process goes like this:

  • Get the current repo state from GitHub
    • This is the names and hashes of all the files and directories, but not the actual file contents.
  • Construct a local, malleable representation of the repo
  • Modify the local representation according to the given updates, creations, and deletions
  • Walk though the modified local "repo" and upload new/changed files and directories to GitHub
    • This must be done from the bottom up because a change at the low level means every directory above that level will need to be changed.
  • Make a new commit pointed at the new root tree (I’ll explain trees soon.)
  • Update the working branch to point to the new commit

This blob post is readable as an IPython Notebook at http://nbviewer.ipython.org/gist/jiffyclub/10809459. I’ve also reproduced the notebook below. Continue reading “More Commits via the GitHub API”

More Commits via the GitHub API

Using Conda Environments and the Fish Shell

I recently started over with a fresh development environment and decided to try something new: I’m using Python 3 via miniconda. The first real hiccup I’ve run into is that conda’s environment activation/deactivation scheme only works in bash or zsh. I use fish. There is an open PR to get fish support for conda but in the meantime I hacked something together to help me out.

"Activating" a conda environment does a couple of things:

  • Puts the environment’s "bin" directory at the front of the PATH environment variable.
  • Sets a CONDA_DEFAULT_ENV environment variable that tells conda in which environment to do things when none is specified.
  • Adds the environment name to the prompt ala virtualenv.

Deactivating the environment resets everything to its pre-activation state. The fish functions I put together work like this:

~ > type python
python is /Users/---/miniconda3/bin/python
~ > condactivate env-name
(env-name) ~ > type python
python is /Users/---/miniconda3/envs/env-name/bin/python
(env-name) ~ > deactivate
~ > type python
python is /Users/---/miniconda3/bin/python

Here’s the text of the functions:

function condalist -d 'List conda environments.'
for dir in (ls $HOME/miniconda3/envs)
echo $dir
end
end
function condactivate -d 'Activate a conda environment' -a cenv
if test -z $cenv
echo 'Usage: condactivate <env name>'
return 1
end
# condabin will be the path to the bin directory
# in the specified conda environment
set condabin $HOME/miniconda3/envs/$cenv/bin
# check whether the condabin directory actually exists and
# exit the function with an error status if it does not
if not test -d $condabin
echo 'Environment not found.'
return 1
end
# deactivate an existing conda environment if there is one
if set -q __CONDA_ENV_ACTIVE
deactivate
end
# save the current path
set -xg DEFAULT_PATH $PATH
# put the condabin directory at the front of the PATH
set -xg PATH $condabin $PATH
# this is an undocumented environmental variable that influences
# how conda behaves when you don't specify an environment for it.
# https://github.com/conda/conda/issues/473
set -xg CONDA_DEFAULT_ENV $cenv
# set up the prompt so it has the env name in it
functions -e __original_fish_prompt
functions -c fish_prompt __original_fish_prompt
function fish_prompt
set_color blue
echo -n '('$CONDA_DEFAULT_ENV') '
set_color normal
__original_fish_prompt
end
# flag for whether a conda environment has been set
set -xg __CONDA_ENV_ACTIVE 'true'
end
function deactivate -d 'Deactivate a conda environment'
if set -q __CONDA_ENV_ACTIVE
# set PATH back to its default before activating the conda env
set -xg PATH $DEFAULT_PATH
set -e DEFAULT_PATH
# unset this so that conda behaves according to its default behavior
set -e CONDA_DEFAULT_ENV
# reset to the original prompt
functions -e fish_prompt
functions -c __original_fish_prompt fish_prompt
functions -e __original_fish_prompt
set -e __CONDA_ENV_ACTIVE
end
end
# aliases so condactivate and deactivate can have shorter names
function ca -d 'Activate a conda environment'
condactivate $argv
end
function cda -d 'Deactivate a conda environment'
deactivate $argv
end
# complete conda environment names when activating
complete -c condactivate -xA -a "(condalist)"
complete -c ca -xA -a "(condalist)"
view raw conda.fish hosted with ❤ by GitHub

Or you can download it from https://gist.github.com/jiffyclub/9679788.

To use these, add them to the ~/.config/fish/ directory and source them from the end of the ~/.config/fish/config.fish file:

source $HOME/.config/fish/conda.fish
Using Conda Environments and the Fish Shell

Making Commits via the GitHub API

For fun I’ve been learning a bit about the GitHub API. Using the API it’s possible to do just about everything you can do on GitHub itself, from commenting on PRs to adding commits to a repo. Here I’m going to show how to do add commits to a repo on GitHub. A notebook demonstrating things with code is available here, but you may want to read this post first for the high level view.

Choosing a Client Library

The GitHub API is an HTTP interface so you can talk to it via any tool that speaks HTTP, including things like curl. To make programming with the API simpler there are a number of libraries that allow communicate with GitHub via means native to whatever language you’re using. I’m using Python and I went with the github3.py library based on its Python 3 compatibility, active development, and good documentation.

Making Commits

The repository api is the gateway for doing anything to a repo. In github3.py this is corresponds to the repository module.

Modifying a Single File

The special case of making a commit affecting a single file is much simpler than affecting multiple files. Creating, updating, and deleting a file can be done via a single API call once you have enough information to specify what you want done.

Modifying Multiple Files

Making a commit affecting multiple files requires making multiple API calls and some understanding of Git’s internal data store. That’s because to change multiple files you have to add all the changes to the repo one at a time before making a commit. The process is outlined in full in the API docs about Git data.

I should note that I think deleting multiple files in a single commit requires a slightly different procedure, one I’ll cover in another post.


That’s the overview, look over the notebook for the code! http://nbviewer.ipython.org/gist/jiffyclub/9235955

Making Commits via the GitHub API