# pandas in the IPython Notebook

I finally got around to playing a tiny bit with pandas today and was delighted when I saw its representation in the IPython notebook. Take a look in the PDF.

IPython detects a special _repr_html_ method on (in this case) the pandas DataFrame object and uses that for the output instead of the usual __str__ or __repr__ methods. Here _repr_html_ returns the contents of the DataFrame in an HTML table, which is also useful for posting to a blog:

Ant Bat Cat Dog
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19

This is another great feature of the IPython notebook and I look forward to it popping up in more places!

# Fuzzy Floating Point Comparison

I was writing some tests today and I ran into a peculiar floating point issue. I had generated a sequence of numbers using numpy.linspace:

>>> np.linspace(0.1, 1, 10)
array([ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

Part of the code I was testing ended up testing whether the value 0.3 was in the range 0.3 – 0.8, including the end points. The answer should of course be yes, but there is a twist due to the actual values in the array returned by linspace:

>>> a = np.linspace(0.1, 1, 10)
>>> 0.3 in a
False
>>> 0.3 < a[2]
True

What’s happening is that the 0.3 returned by linspace is really 0.30000000000000004, but the 0.3 when I type 0.3 is really 0.29999999999999999. It’s not clear whether this situation would ever actually arise in the normal usage of the code I was testing, but I wanted to make sure this wouldn’t cause problems. My solution was to make a function which would test whether a value was in a given range with a tiny bit of fuzziness at the edges.

NumPy has a useful function for comparing floating point values within tolerances called allclose. But that’s for comparing equality, I need fuzzy (but not very fuzzy) less than / greater than comparisons. To provide just that little bit of fuzziness I turned to the numpy.nextafter function.

nextafter gives the next representable floating point number after the first  input value. The second input value controls the direction so you can get the next value either up or down. It turns out that the two numbers that are tripping me up are right next to each other in their floating point representation:

>>> np.nextafter(0.29999999999999999, 1)
0.30000000000000004
>>> np.nextafter(0.30000000000000004, 0)
0.29999999999999999

So to catch this case my range checking function only needs one ULP of fuzziness (which is not much at all) to handle this floating point error. To allow for this I wrote a function called fuzzy_between that takes a value and the lower and upper bounds of the test range and expands the test range by a couple ULP before doing a simple minval <= val <= maxval comparison:


import numpy as np

def fuzzy_between(val, minval, maxval, fuzz=2, inclusive=True):
"""
Test whether a value is within some range with some fuzziness at the edges
to allow for floating point noise.

The fuzziness is implemented by expanding the range at each end fuzz steps
using the numpy.nextafter function. For example, with the inputs
minval = 1, maxval = 2, and fuzz = 2; the range would be expanded to
minval = 0.99999999999999978 and maxval = 2.0000000000000009 before doing
comparisons.

Parameters
----------
val : float
Value being tested.

minval : float
Lower bound of range. Must be lower than maxval.

maxval : float
Upper bound of range. Must be higher than minval.

fuzz : int, optional
Number of times to expand bounds using numpy.nextafter.

inclusive : bool, optional
Set whether endpoints are within the range.

Returns
-------
is_between : bool
True if val is between minval and maxval, False otherwise.

"""
# expand bounds
for _ in xrange(fuzz):
minval = np.nextafter(minval, minval - 1e6)
maxval = np.nextafter(maxval, maxval + 1e6)

if inclusive:
return minval <= val <= maxval

else:
return minval < val < maxval


For a great discussion on comparing floating point numbers see this randomascii post, and for some interesting discussion on the fallibility of range functions see this post on Google+ by Guido van Rossum. Guido actually calls out numpy.linspace as a range function not susceptible to floating point drift (since it’s calculating intervals, not adding numbers), but it’s always possible to get surprises with floating point numbers.

Working at STSCI I frequently deal with FITS files and while working on the new CALACS pipeline I often needed to see and change header values to run CALACS under different regimes. This can be done with IRAF tasks imhead and hedit but I try to use IRAF as little as possible.

PyFITS has convenience functions for working with FITS headers but as of version 3.0.7 it doesn’t come with any scripts that make them accessible from the command line. I expect that soon PyFITS and/or AstroPy will have these scripts, but in the meantime here are the ones I use. Continue reading “Python imhead and hedit”

# IPython HTML Notebook

Until recently I had never been a fan of IPython but with their HTML notebook they’ve finally won me over. What I like about this tool is that it makes it easy to go back and forth between interactive prototyping and a script. Being able to continuously edit and re-run code in an interactive session is a powerful tool.

The notebook also makes a great tutorial and demo tool. Here’s a PDF of my session developing a Python replacement for IDL’s GAUSSFIT function.

# Installation

The IPython notebook requires a few extra packages but if you have a setup like me it’s easy to get everything installed:

brew install zeromq
pip install pyzmq
pip install ipython

After doing this you may also want to locally install MathJax for JavaScript equation rendering:

from IPython.external.mathjax import install_mathjax
install_mathjax()

To launch the notebook from whatever directory you want to work in:

ipython notebook

This will launch the IPython notebook dashboard in your default browser, from which you can make new notebooks or resume working on existing ones.

See the docs for all you can do with the notebook, and enjoy!

# IDL’s GAUSSFIT in Python

A colleague recently asked for help getting the functionality of IDL’s GAUSSFIT function working in Python. This was a perfect opportunity to use the handy curve_fit function from SciPy. Here’s the code:

import numpy as np
from scipy.optimize import curve_fit

def fit_func(x, a0, a1, a2, a3, a4, a5):
z = (x - a1) / a2
y = a0 * np.exp(-z**2 / a2) + a3 + a4 * x + a5 * x**2
return y

parameters, covariance = curve_fit(fit_func, xdata, ydata)



The file focus_output.dat just contains some data in two columns of numbers. For more info on loadtxt see my post on reading text tables. fit_func defines the function we want to fit to the data. In this case it is a Gaussian plus a quadratic, the same as used in GAUSSFIT when NTERMS=6. Now, to plot the results:

import matplotlib.pyplot as plt

fitdata = fit_func(xdata, *parameters)

fig = plt.figure(figsize=(6,4), frameon=False)
ax = fig.add_axes([0, 0, 1, 1], axisbg='k')

ax.plot(xdata, ydata, 'c-', xdata, fitdata, 'm-', linewidth=3)

ax.set_ylim(0.38, 1.02)

fig.savefig('gauss_fit_demo.png')



# Reading Text Tables with Python

Reading tables is a pretty common thing to do and there are a number of ways to read tables besides writing a read function yourself. That’s not to say these are magic bullets. Every table is different and can have its own eccentricities. If you find yourself reading the same type of quirky file over and over again it could be worth your effort to write your own reader that does things just the way you like. That said, here are some other options.

# Install Python, NumPy, SciPy, and matplotlib on Mac OS X – Double Click

Update: These instructions are over a year old, though they may still work for you. See the “Install Python” page for the most up-to-date instructions.

I’ve already written a post about installing Python, NumPy, SciPy, and matplotlib on Lion, but it involves a lot of working at the command line, modifying your .bash_profile and dealing with compiler problems. That’s what I’ll call the compile-it-yourself (CIY) method. What I’ll describe below I’ll call the “double click” method.

I personally use the CIY method because it allows me to very easily control what’s installed. With Homebrew and pip I can uninstall and upgrade different things at will, or choose to install bleeding-edge versions. But it’s more hassle than everyone wants and there’s now an easier way using double-click installers.

Until recently the CIY was the only way to get everything working on Lion but now the developers of NumPy, SciPy, and matplotlib have all caught up and it’s possible to just download and double-click on a few DMG files to get a basic scientific Python installation working. Once you get to know Python, though, you will undoubtedly want to install some other packages and when that time comes I suggest you use pip.

# Teaching Python at Software Carpentry – Toronto, February 2012

Last week I had the privilege of teaching Python (and generally helping out) at a Software Carpentry bootcamp at the University of Toronto. Mike Fletcher and I worked together to make a lesson plan targeted at students completely new to Python. Overall it was a great experience and I had a lot of fun meeting some other Pythonistas. Here I’d like to talk about some lessons learned from our teaching. (Update: Mike has written up his thoughts as well.)

The overall narrative of our lessons was that they were trying to read a CSV data file. We built up the very basics (data types, if, for) up through functions and modules in a total time of about 5 hours. We tried to alternate between a few minutes of lecturing and a few minutes of exercises that students typed and ran on their own computers using template files we had given them. I think we built a good narrative and I particularly liked how we segued into writing functions, writing modules, and re-using code. However, for the time available and the students we had I think we could make improvements.

Our students were mostly science graduate students. They were largely not experienced programmers and none of them (as far as I know) had experience with Python. We had just a few hours with them, though they will be continuing with online lessons. Our particular failing in this situation was that the complexity of our exercises quickly outpaced the experience of our students, making them confused and slowing everything down. When I do this again I will probably iterate on the same content but update the exercise templates to be much more filled in, requiring the students to fill in only a couple of key lines. This will hopefully allow us to move through more content in less time without sacrificing much in terms of what the students practice.

Teaching in such a short amount of time is a challenge. How much time do you spend selling Python, all the great modules in the standard library, and the great third-party packages that make Python an excellent choice for scientific computing? If you do some really compelling demos maybe the students will be more inspired to continue using Python on their own and it won’t matter that you’ve sacrificed time you could have spent teaching them the basics. How much time do you spend teaching them about installing and managing Python and third-party packages? It’s hard to make a case that you should teach this at all in the first class but it’s bound to come up for everyone when they stumble across cool-package-X.

I definitely feel some initial demos are in order, just enough to say, “Look, you can do all this great stuff with Python!” Then you’ve got to start teaching something and keep it moving as you do.

# The Udacity Teaching Model

I’m currently taking CS373 at Udacity (they also offer CS101). They have a pretty cool teaching model with a few minutes of instruction followed by short quizzes so that you get immediate feedback on how you’re doing. The quizzes even involve programming: you’re given a nice text editor (implemented in JavaScript, I believe) that does nice syntax highlighting and automatic indenting. It’s usually pre-populated with some starting code and test variables. You write your code and click “Run” to see the output. When it’s satisfactory you click “Submit” to have your output scored against what is expected. If you like you can write your code in your own editor/environment and copy it back when you’re ready to submit.

The Udacity classes are not live but I can imagine a similar tool being useful in a live classroom setting because it gives the teacher the chance to see what students are writing and what output they’re getting. The code is run on the server side so it can be consistent for everyone. And though it’s not an ideal development environment, it does help for students who might not have Python installed on their computers. One drawback is that I don’t know how it would work when it comes to teaching students to read and write files.

# Install Python, NumPy, SciPy, and matplotlib on Mac OS X

Update: These instructions are over a year old, though they may still work for you. See the “Install Python” page for the most recent instructions.

A bit ago a friend and I both had fresh Mac OS X Lion installs so I helped him set up his computers with a scientific Python setup and did mine at the same time.

These instructions are for Lion but should work on Snow Leopard or Mountain Lion without much trouble. On Snow Leopard you won’t install Xcode via the App Store, you’ll have to download it from Apple.

After I’d helped my friend I found this blog post describing a procedure pretty much the same as below.

Update: If doing all the stuff below doesn’t seem like your cup of tea, it’s also possible to install Python, NumPy, SciPy, and matplotlib using double-click binary installers (resulting in a much less flexible installation), see this post to learn how.