Testing With NumPy and Pandas

Testing Python results is often as straightforward as assert result == expected, especially with builtin types. But that doesn’t work with NumPy or Pandas data structures because using == with those doesn’t return True or False. Instead, == results in new arrays filled with boolean values. This is useful for boolean indexing, but leads to this error when testing:

In [2]: a = np.arange(10)

In [3]: b = np.arange(10)

In [4]: assert a == b
------------------------------------------------
ValueError     Traceback (most recent call last)
<ipython-input-4-6bf76ad3480a> in <module>()
----> 1 assert a == b

ValueError: The truth value of an array with more than one element is ambiguous.
            Use a.any() or a.all()

You can check whether all the elements in two arrays are equal using the .all() method:

In [5]: (a == b).all()
Out[5]: True

But that errs if the arrays are different sizes/shapes, and the result is an uninformative True or False when they are the same size. Luckily, NumPy has this situation covered.

Library Versions

For reference, these are the versions of NumPy and Pandas I’m currently using:

In [43]: np.version.version
Out[43]: '1.9.0'

In [44]: pd.version.version
Out[44]: '0.14.1'

Testing with NumPy

NumPy has an entire module devoted to testing support. I like to import it via import numpy.testing as npt in my tests. I’ll be focusing here on two functions, assert_array_equal and assert_allclose.

assert_array_equal

assert_array_equal raises an AssertionError when to arrays are not exactly equal. It can take anything array-like as inputs, including lists.

In [10]: npt.assert_array_equal([1, 2, 3], [1, 2, 3])

In [11]: npt.assert_array_equal([1, 2, 3], [1, 2, 3, 4, 5])
----------------------------------------------------
AssertionError     Traceback (most recent call last)
<truncated>

AssertionError:
Arrays are not equal

(shapes (3,), (5,) mismatch)
 x: array([1, 2, 3])
 y: array([1, 2, 3, 4, 5])

In [12]: npt.assert_array_equal([1, 2, 3], [99, 2, 3])
----------------------------------------------------
AssertionError     Traceback (most recent call last)
<truncated>

AssertionError:
Arrays are not equal

(mismatch 33.33333333333333%)
 x: array([1, 2, 3])
 y: array([99,  2,  3])

The examples show how you get somewhat descriptive output when the comparisons fail, including if the shapes are mismatched and what percentage of elements differ between the two arrays.

Similar functionality is available in the array_equal function, which returns True or False instead of raising an exception.

assert_allclose

assert_array_equal checks for exact equality. That’s fine for integer and boolean values, but often fails with floating point values because of very slight differences in the results of values calculated different ways or on different computers. For comparing floating point values I use assert_allclose.

In [17]: npt.assert_array_equal([np.pi], [np.sqrt(np.pi) ** 2])
-------------------------------------------------------
AssertionError        Traceback (most recent call last)
<truncated>

AssertionError:
Arrays are not equal

(mismatch 100.0%)
 x: array([ 3.141593])
 y: array([ 3.141593])

In [18]: npt.assert_allclose([np.pi], [np.sqrt(np.pi) ** 2])

assert_allclose takes atol and rtol arguments for specifying the absolute and relative tolerance of the comparison. For the most part I leave these at their defaults: atol=0 and rtol=1e-07. That’s a small enough tolerance that I’m confident the numbers are quite close, but large enough to let floating point noise go through. Sometimes, though, it’s useful to choose custom tolerances. For example, I was once writing tests based on numbers I copied out of a paper. The numbers were provided to four decimal places so in my tests I used npt.assert_allclose(result, expected, atol=0.0001). Choosing appropriate tolerances for testing with assert_allclose can be tricky depending how accurate you expect your code to be. Unfortunately, I don’t have any great advise on that.

assert_allclose also has a non-assertion version: allclose.

Notes

One very handy thing about assert_array_equal (and its scalar friendly cousin assert_equal) is that it handles values like nan intelligently. Normally nan compared to anything else, even nan, results in False. That’s the official, expected behavior, but it does make testing harder. assert_array_equal handles this for you.

In [29]: (np.array([np.nan, 2, 3]) == np.array([np.nan, 2, 3])).all()
Out[29]: False

In [30]: npt.assert_array_equal([np.nan, 2, 3], [np.nan, 2, 3])

Note that array_equal and equal behave in the official manner and will always return False for comparisons to nan.

Testing with Pandas

Pandas also has a testing module, but it is apparently meant more for internal testing of Pandas itself than for Pandas users. There is no documentation page for it, but it’s still available and I use it in testing. I import it via import pandas.util.testing as pdt.

The three main things I use are assert_frame_equal, assert_series_equal, and assert_index_equal. assert_frame_equal and assert_series_equal take arguments that let you control whether the comparisons are exact or approximate, and whether to compare types in addition to value equality. By default they use an allclose-like comparison.

In [39]: s1 = pd.Series([1, 2, 3], dtype='int')

In [40]: s2 = pd.Series([1, 2, 3], dtype='float')

In [41]: pdt.assert_series_equal(s1, s2)
-------------------------------------------------------
AssertionError        Traceback (most recent call last)
<truncated>

AssertionError: attr is not equal [dtype]: dtype('int64') != dtype('float64')

In [42]: pdt.assert_series_equal(s1, s2, check_dtype=False)

assert_frame_equal is sensitive to the order of columns and rows in the tables. I’ve found this is not always what I want, sometimes it’s fine if ordering changes as long as the same column names and index labels are in both tables. I’ve made my own assert_frames_equal function for testing that case.


Just because you’re using complex data containers like arrays and DataFrames in your code doesn’t mean you can’t test it. NumPy and Pandas are themselves heavily tested and you can test your own code using the same utilities the NumPy and Pandas developers use. Happy testing!

Performance of Pandas Series vs NumPy Arrays

I recently spent a day working on the performance of a Python function and learned a bit about Pandas and NumPy array indexing. The function is iterative, looping over data and updating some row weights until it meets convergence criteria. I tried to do as much processing as I could before the loops, but some indexing (and of course arithmetic) had to stay inside the loops.

When I looked at profiles of the function almost all of the time was being spent doing indexing on Pandas Series objects. A quick investigation shows that indexing Series objects is quite slow compared to NumPy arrays. First, some setup: Read More »

Install Scientific Python on Mac OS X

These instructions detail how I install the scientific Python stack on my Mac. You can always check the Install Python page for other installation options.

I’m running the latest OS X Mountain Lion (10.8) but I think these instructions should work back to Snow Leopard (10.6). These instructions differ from my previous set primarily in that I now use Homebrew to install NumPy, SciPy, and matplotlib. I do this because Homebrew makes it easier to compile these with non-standard options that work around an issue with SciPy on OS X.

I’ll show how I install Python and the basic scientific Python stack:

If you need other libraries they can most likely be installed via pip and any dependencies can probably be installed via Homebrew.

Command Line Tools

The first order of business is to install the Apple command line tools. These include important things like development headers, gcc, and git. Head over to developer.apple.com/downloads, register for a free account, and download (then install) the latest “Command Line Tools for Xcode” for your version of OS X.

If you’ve already installed Xcode on Lion or Mountain Lion then you can install the command line tools from the preferences. If you’ve installed Xcode on Snow Leopard then you already have the command line tools.

Homebrew

Homebrew is my favorite package manager for OS X. It builds packages from source, intelligently re-uses libraries that are already part of OS X, and encourages best practices like installing Python packages with pip.

To install Homebrew paste the following in a terminal:

ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"

The brew command and any executables it installs will go in the directory /usr/bin/local so you want to make sure that goes at the front of your system’s PATH. As long as you’re at it, you can also add the directory where Python scripts get installed. Add the following line to your .profile, .bash_profile, or .bashrc file:

export PATH=/usr/local/bin:/usr/local/share/python:$PATH

At this point you should close your terminal and open a new one so that this PATH setting is in effect for the rest of the installation.

Python

Now you can use brew to install Python:

brew install python

Afterwards you should be able to run the commands

which python
which pip

and see

/usr/local/bin/python
/usr/local/bin/pip

for each, respectively. (It’s also possible to install Python 3 using Homebrew: brew install python3.)

NumPy

It is possible to use pip to install NumPy, but I use a Homebrew recipe so I avoid some problems with SciPy. The recipe isn’t included in stock Homebrew though, it requires “tapping” two other sources of Homebrew formula:

brew tap homebrew/science
brew tap samueljohn/python

You can learn more about these at their respective repositories:

With those repos tapped you can almost install NumPy, but first you’ll have
to use pip to install nose:

pip install nose

I compile NumPy against OpenBLAS to avoid a SciPy issue. Compiling OpenBLAS requires gfortran, which you can get via Homebrew:

brew install gfortran
brew install numpy --with-openblas

SciPy

And then you’re ready for SciPy:

brew install scipy --with-openblas

matplotlib

matplotlib generally installs just fine via pip but the custom Homebrew formula takes care of installing optional dependencies too:

brew install matplotlib

IPython

You’ll want Notebook support with IPython and that requires some extra dependencies, including ZeroMQ via brew:

brew install zeromq
pip install jinja2
pip install tornado
pip install pyzmq
pip install ipython

pandas

Pandas should install via pip:

pip install pandas

Testing It Out

The most basic test you can do to make sure everything worked is open up an IPython session and type in the following:

import numpy
import scipy
import matplotlib
import pandas

If there are no errors then you’re ready to get started! Congratulations and enjoy!