Software Carpentry at Johns Hopkins

This week Joshua Smith and I hosted a Software Carpentry boot camp at Johns Hopkins University in Baltimore. We also had awesome teaching help from Sasha Wood and Mike Droettboom.

We opted for a small class since this was our first time hosting a boot camp and because of space limitations. Based on the rate at which people signed up for the class it didn’t seem like there was massive local demand anyway, but we were pleasantly surprised when we had a student from Brooklyn, NY and a student commuting from Virginia. There is definitely some existing demand for the skills Software Carpentry offers and I’m glad we could put on an accessible boot camp for those people.  Most of the rest of the students were physics and astronomy grad students or post-docs from JHU and STScI.

The boot camp was broken into four half-day courses: shell, Python, version control, and software engineering. Mike and I co-taught the Python and software engineering sessions.

The overall feedback from the students was quite positive and I’m looking forward to doing this again. (Here is the requisite good/bad Software Carpentry feedback post: http://software-carpentry.org/2012/06/feedback-from-johns-hopkins/.) Below I have some notes on the sections I taught, plus some overall thoughts.Read More »

pandas in the IPython Notebook

I finally got around to playing a tiny bit with pandas today and was delighted when I saw its representation in the IPython notebook. Take a look in the PDF.

IPython detects a special _repr_html_ method on (in this case) the pandas DataFrame object and uses that for the output instead of the usual __str__ or __repr__ methods. Here _repr_html_ returns the contents of the DataFrame in an HTML table, which is also useful for posting to a blog:

Ant Bat Cat Dog
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19

This is another great feature of the IPython notebook and I look forward to it popping up in more places!

Udacity: What Motivates Me?

A recent Udacity blog post asks its readers to describe what motivates them to complete online courses. I’ve completed four of these courses now so I should be in a good position to describe my motivation.

I take these courses for two reasons: to hopefully make myself a better job candidate and to learn cool new stuff from really experienced people.

I don’t have a formal computer science education so Stanford’s AI Class and Udacity’s How to Program a Robotic Car were good (and free) opportunities to learn about topics like search, planning, and filters. I expanded my vocabulary with things like A-star and breadth first search, Kalman and particle filters, and dynamic programming. How to Program a Robotic Car even had us writing programs using these topics, which I found to be a big help in learning them.

Another area I don’t have much experience with is web programming so I took Udacity’s Web Application Engineering with Steve Huffman. It was a fun and practical course. I learned about HTTP requests and responses, cookies, tracking, securely storing passwords, databases, cacheing, and more. I made a functioning web application with Google App Engine.

I didn’t have any trouble staying motivated to finish the courses. I found the material interesting enough that I was always looking forward to the next class. I love the digital certificate I get at the end. (I put them on Dropbox so I can link to them from my résumé.) One advantage I had in the Udacity courses is that I’m already a Python programmer so I could focus entirely on the content of the courses without the language getting in the way. (The instructors of How to Program a Robotic Car went out of their way to make their code as un-Pythonic as possible, though. I think to make it a bit less intimidating to people coming from other languages.)

The Udacity folks have been experimenting with classes with and without deadlines. The current MO seems to be to have deadlines the first time a course is offered and then leave the same material up and offer the class without deadlines thereafter. (The final is still scheduled with a deadline.) I think my wife really likes the deadlines and schedules because it means I can only spend so much time on a class in one week and I can point to a definite point in the future when the class will be done. Left to my own devices I would probably try to finish these courses in one short burst. I also feel like the deadlines prevent me from indefinitely putting the courses off.

Whether these courses will help me the next time I go looking for a job remains to be seen. Udacity is starting to open up info on its students to companies looking to hire but I doubt I will stand out from the other computing industry professionals taking these courses (and judging from the forums there seem to be a lot of them). I can say, though, that I’ve learned a lot and enjoyed doing it. Next up: Software Testing: How to Make Software Fail.

Fuzzy Floating Point Comparison

I was writing some tests today and I ran into a peculiar floating point issue. I had generated a sequence of numbers using numpy.linspace:

>>> np.linspace(0.1, 1, 10) 
array([ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

Part of the code I was testing ended up testing whether the value 0.3 was in the range 0.3 – 0.8, including the end points. The answer should of course be yes, but there is a twist due to the actual values in the array returned by linspace:

>>> a = np.linspace(0.1, 1, 10)
>>> 0.3 in a
False
>>> 0.3 < a[2]
True

What’s happening is that the 0.3 returned by linspace is really 0.30000000000000004, but the 0.3 when I type 0.3 is really 0.29999999999999999. It’s not clear whether this situation would ever actually arise in the normal usage of the code I was testing, but I wanted to make sure this wouldn’t cause problems. My solution was to make a function which would test whether a value was in a given range with a tiny bit of fuzziness at the edges.

NumPy has a useful function for comparing floating point values within tolerances called allclose. But that’s for comparing equality, I need fuzzy (but not very fuzzy) less than / greater than comparisons. To provide just that little bit of fuzziness I turned to the numpy.nextafter function.

nextafter gives the next representable floating point number after the first  input value. The second input value controls the direction so you can get the next value either up or down. It turns out that the two numbers that are tripping me up are right next to each other in their floating point representation:

>>> np.nextafter(0.29999999999999999, 1)
0.30000000000000004
>>> np.nextafter(0.30000000000000004, 0)
0.29999999999999999

So to catch this case my range checking function only needs one ULP of fuzziness (which is not much at all) to handle this floating point error. To allow for this I wrote a function called fuzzy_between that takes a value and the lower and upper bounds of the test range and expands the test range by a couple ULP before doing a simple minval <= val <= maxval comparison:


import numpy as np

def fuzzy_between(val, minval, maxval, fuzz=2, inclusive=True):
    """
    Test whether a value is within some range with some fuzziness at the edges
    to allow for floating point noise.

    The fuzziness is implemented by expanding the range at each end `fuzz` steps
    using the numpy.nextafter function. For example, with the inputs
    minval = 1, maxval = 2, and fuzz = 2; the range would be expanded to
    minval = 0.99999999999999978 and maxval = 2.0000000000000009 before doing
    comparisons.

    Parameters
    ----------
    val : float
        Value being tested.

    minval : float
        Lower bound of range. Must be lower than `maxval`.

    maxval : float
        Upper bound of range. Must be higher than `minval`.

    fuzz : int, optional
        Number of times to expand bounds using numpy.nextafter.

    inclusive : bool, optional
        Set whether endpoints are within the range.

    Returns
    -------
    is_between : bool
        True if `val` is between `minval` and `maxval`, False otherwise.

    """
    # expand bounds
    for _ in xrange(fuzz):
        minval = np.nextafter(minval, minval - 1e6)
        maxval = np.nextafter(maxval, maxval + 1e6)

    if inclusive:
        return minval <= val <= maxval

    else:
        return minval < val < maxval

For a great discussion on comparing floating point numbers see this randomascii post, and for some interesting discussion on the fallibility of range functions see this post on Google+ by Guido van Rossum. Guido actually calls out numpy.linspace as a range function not susceptible to floating point drift (since it’s calculating intervals, not adding numbers), but it’s always possible to get surprises with floating point numbers.