So, what does Conda do well and what needs improvement? (more…)
Testing Python results is often as straightforward as
assert result == expected, especially with builtin types. But that doesn’t work with NumPy or Pandas data structures because using
== with those doesn’t return
== results in new arrays filled with boolean values. This is useful for boolean indexing, but leads to this error when testing:
In : a = np.arange(10) In : b = np.arange(10) In : assert a == b ------------------------------------------------ ValueError Traceback (most recent call last) <ipython-input-4-6bf76ad3480a> in <module>() ----> 1 assert a == b ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
You can check whether all the elements in two arrays are equal using the
In : (a == b).all() Out: True
But that errs if the arrays are different sizes/shapes, and the result is an uninformative
False when they are the same size. Luckily, NumPy has this situation covered.
For reference, these are the versions of NumPy and Pandas I’m currently using:
In : np.version.version Out: '1.9.0' In : pd.version.version Out: '0.14.1'
Testing with NumPy
assert_array_equal raises an
AssertionError when to arrays are not exactly equal. It can take anything array-like as inputs, including lists.
In : npt.assert_array_equal([1, 2, 3], [1, 2, 3]) In : npt.assert_array_equal([1, 2, 3], [1, 2, 3, 4, 5]) ---------------------------------------------------- AssertionError Traceback (most recent call last) <truncated> AssertionError: Arrays are not equal (shapes (3,), (5,) mismatch) x: array([1, 2, 3]) y: array([1, 2, 3, 4, 5]) In : npt.assert_array_equal([1, 2, 3], [99, 2, 3]) ---------------------------------------------------- AssertionError Traceback (most recent call last) <truncated> AssertionError: Arrays are not equal (mismatch 33.33333333333333%) x: array([1, 2, 3]) y: array([99, 2, 3])
The examples show how you get somewhat descriptive output when the comparisons fail, including if the shapes are mismatched and what percentage of elements differ between the two arrays.
Similar functionality is available in the
array_equal function, which returns
False instead of raising an exception.
assert_array_equal checks for exact equality. That’s fine for integer and boolean values, but often fails with floating point values because of very slight differences in the results of values calculated different ways or on different computers. For comparing floating point values I use
In : npt.assert_array_equal([np.pi], [np.sqrt(np.pi) ** 2]) ------------------------------------------------------- AssertionError Traceback (most recent call last) <truncated> AssertionError: Arrays are not equal (mismatch 100.0%) x: array([ 3.141593]) y: array([ 3.141593]) In : npt.assert_allclose([np.pi], [np.sqrt(np.pi) ** 2])
rtol arguments for specifying the absolute and relative tolerance of the comparison. For the most part I leave these at their defaults:
rtol=1e-07. That’s a small enough tolerance that I’m confident the numbers are quite close, but large enough to let floating point noise go through. Sometimes, though, it’s useful to choose custom tolerances. For example, I was once writing tests based on numbers I copied out of a paper. The numbers were provided to four decimal places so in my tests I used
npt.assert_allclose(result, expected, atol=0.0001). Choosing appropriate tolerances for testing with
assert_allclose can be tricky depending how accurate you expect your code to be. Unfortunately, I don’t have any great advise on that.
assert_allclose also has a non-assertion version:
One very handy thing about
assert_array_equal (and its scalar friendly cousin
assert_equal) is that it handles values like
nan intelligently. Normally
nan compared to anything else, even
nan, results in
False. That’s the official, expected behavior, but it does make testing harder.
assert_array_equal handles this for you.
In : (np.array([np.nan, 2, 3]) == np.array([np.nan, 2, 3])).all() Out: False In : npt.assert_array_equal([np.nan, 2, 3], [np.nan, 2, 3])
equal behave in the official manner and will always return
False for comparisons to
Testing with Pandas
Pandas also has a testing module, but it is apparently meant more for internal testing of Pandas itself than for Pandas users. There is no documentation page for it, but it’s still available and I use it in testing. I import it via
import pandas.util.testing as pdt.
The three main things I use are
assert_series_equal take arguments that let you control whether the comparisons are exact or approximate, and whether to compare types in addition to value equality. By default they use an
In : s1 = pd.Series([1, 2, 3], dtype='int') In : s2 = pd.Series([1, 2, 3], dtype='float') In : pdt.assert_series_equal(s1, s2) ------------------------------------------------------- AssertionError Traceback (most recent call last) <truncated> AssertionError: attr is not equal [dtype]: dtype('int64') != dtype('float64') In : pdt.assert_series_equal(s1, s2, check_dtype=False)
assert_frame_equal is sensitive to the order of columns and rows in the tables. I’ve found this is not always what I want, sometimes it’s fine if ordering changes as long as the same column names and index labels are in both tables. I’ve made my own
assert_frames_equal function for testing that case.
Just because you’re using complex data containers like arrays and DataFrames in your code doesn’t mean you can’t test it. NumPy and Pandas are themselves heavily tested and you can test your own code using the same utilities the NumPy and Pandas developers use. Happy testing!
I recently spent a day working on the performance of a Python function and learned a bit about Pandas and NumPy array indexing. The function is iterative, looping over data and updating some row weights until it meets convergence criteria. I tried to do as much processing as I could before the loops, but some indexing (and of course arithmetic) had to stay inside the loops.
When I looked at profiles of the function almost all of the time was being spent doing indexing on Pandas Series objects. A quick investigation shows that indexing Series objects is quite slow compared to NumPy arrays. First, some setup: (more…)
This week docs.scipy.org has been down, but folks still need their NumPy and SciPy docs. To fill the gap until docs.scipy.org is back up I built the docs for only the latest stable releases and uploaded them to GitHub pages:
How to Build
(Note that I’m working on a Mac and these instructions are a little Mac/Linux oriented. The procedure on Windows would not be drastically different, though.)
Yesterday I asked my followers on Twitter for their advice on the best resources for people learning programming and Python:
You can see their responses on Twitter and below.
- olgabot: Code Academy
- modernscientist: Learn Python the Hard Way
- AlexVianaPro: Think Python
- oceankidbilly: A list of resources posted on reddit
- ptone: How to Think Like a Computer Scientist
- pani5ue: Google’s Python Class
These are some of the resources I learned from back when I picked up Python, though I should note that I already knew some programming at the time:
Thanks to everyone who responded!
I wrote a bit ago about making commits via the GitHub API. That post outlined making changes in two simplified situations: making changes to a single file and making updates to two existing files at the root of the repository. Here I show a more general solution that allows arbitrary changes anywhere in the repo.
I want to be able to specify a repo and branch and say "here are the contents of files that have changed or been created and here are the names of files that have been deleted, please take all that and this message and make a new commit for me." Because the GitHub API is so rudimentary when it comes to making commits that will end up being a many-stepped process, but it’s mostly the same steps repeated many times so it’s not a nightmare to code up. At a high level the process goes like this:
- Get the current repo state from GitHub
- This is the names and hashes of all the files and directories, but not the actual file contents.
- Construct a local, malleable representation of the repo
- Modify the local representation according to the given updates, creations, and deletions
- Walk though the modified local "repo" and upload new/changed files and directories to GitHub
- This must be done from the bottom up because a change at the low level means every directory above that level will need to be changed.
- Make a new commit pointed at the new root tree (I’ll explain trees soon.)
- Update the working branch to point to the new commit
This blob post is readable as an IPython Notebook at http://nbviewer.ipython.org/gist/jiffyclub/10809459. I’ve also reproduced the notebook below. (more…)
Docker is a great tool for getting lightweight, isolated Linux environments. It uses technology that doesn’t work natively on Macs. Until now you’ve had to boot into a VM to install and use Docker, but it’s now a little easier than that.
As of Docker 0.8 it can be run on Macs thanks to a specially developed, lightweight VirtualBox VM. There are official instructions for installing Docker on Mac, but with Homebrew and cask it’s even easier.
Follow the instructions on the cask homepage to install it. Cask is an extension to Homebrew for installing Mac binary packages via the command line. Think things like Chrome or Steam. Or VirtualBox. Running Docker on Mac requires VirtualBox so if you don’t have it already:
brew cask install virtualbox
Then install Docker and the helper tool
brew install docker brew install boot2docker
boot2docker takes care of the VM that Docker runs in. To get things started it needs to download the Docker VM and start a daemon that the
docker command line tool will talk to:
boot2docker init boot2docker up
docker command line tool should now be able to talk to the daemon and if you run
docker version you should see a report for both a server and a client. (Note: When I ran
boot2docker up it told me that the default port the daemon uses was already taken. I had to specify a different port via the
DOCKER_HOST environment variable, which I now set in my shell configuration.)
If everything has gone well to this point you should now be able to start up a Docker instance. This command will drop you into a bash shell in Ubuntu:
docker run -i -t ubuntu /bin/bash
ctrl-D to exit. I find this especially helpful for very quickly getting to a Linux command line from my Mac for testing this or that, like checking what versions of software are installing by
Visit the Docker documentation to learn more about what you can do with Docker and how to do it.