Getting the Right Answer

Our previous lessons have introduced the basic tools of programming: variables and lists, file I/O, loops, conditionals, and most importantly, functions. What they haven’t done is show us how to tell if a program is getting the right answer. For the sake of argument, if each line we write has a 99% chance of being right, then a 70-line program will be wrong more than half the time. We need to do better than that, which means we need to:

  • write programs that check their own operation; and
  • write tests to catch the mistakes those self-checks miss.

Along the way, we will learn:

  • how Python reports and handles errors;
  • how to use a unit testing framework;
  • when it’s useful to write tests before writing code.

Defensive Programming

We made several mistakes while writing the programs in our first few lessons. How can we be sure that there aren’t still errors lurking in the code we have? And how can we guard against introducing new errors in code as we modify it?

The first step is to use defensive programming, i.e., to assume that mistakes will happen and to guard against them. One way to do this is to add assertions to our code so that it checks itself as it runs. An assertion is simply a statement that something must be true at a certain point in a program. When Python sees one, it checks that the assertion’s condition. If it’s true, Python does nothing, but if it’s false, Python halts the program immediately and prints the error message provided. For example, this piece of code halts as soon as the loop encounters a value that isn’t positive:

numbers = [1.5, 2.3, 0.7, -0.001, 4.4]
total = 0.0
for n in numbers:
    assert n >= 0.0, 'Data should only contain positive values'
    total += n
print 'total is:', total
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-1-33d87ea29ae4> in <module>()
      2 total = 0.0
      3 for n in numbers:
----> 4     assert n >= 0.0, 'Data should only contain positive values'
      5     total += n
      6 print 'total is:', total

AssertionError: Data should only contain positive values

Programs like the Firefox browser are littered with assertions: 10-20% of the code they contain are there to check that the other 80-90% are working correctly. Broadly speaking, assertions fall into three categories:

  • A precondition is something that must be true at the start of a function in order for it to work correctly.
  • A postcondition is something that the function guarantees is true when it finishes.
  • An invariant is something that is always true at a particular point inside a piece of code.

For example, suppose we are representing rectangles using a tuple of four coordinates (x0, y0, x1, y1). In order to do some calculations, we need to normalize the rectangle so that it is at the origin and 1.0 units long on its longest axis. This function does that, but checks that its input is correctly formatted and that its result makes sense:

def normalize_rectangle(rect):
    assert len(rect) == 4, 'Rectangles must contain 4 coordinates'
    x0, y0, x1, y1 = rect
    assert x0 < x1, 'Invalid X coordinates'
    assert y0 < y1, 'Invalid Y coordinates'

    dx = x1 - x0
    dy = y1 - y0
    if dx > dy:
        scaled = float(dx) / dy
        upper_x, upper_y = 1.0, scaled
    else:
        scaled = float(dx) / dy
        upper_x, upper_y = scaled, 1.0

    assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
    assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'

    return (0, 0, upper_x, upper_y)

The preconditions on lines 2, 4, and 5 catch invalid inputs:

print normalize_rectangle( (0.0, 1.0, 2.0) ) # missing the fourth coordinate
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-22-3a97b1dcab70> in <module>()
----> 1 print normalize_rectangle( (0.0, 1.0, 2.0) ) # missing the fourth coordinate

<ipython-input-21-fdb49ef456c2> in normalize_rectangle(rect)
      1 def normalize_rectangle(rect):
----> 2     assert len(rect) == 4, 'Rectangles must contain 4 coordinates'
      3     x0, y0, x1, y1 = rect
      4     assert x0 < x1, 'Invalid X coordinates'
      5     assert y0 < y1, 'Invalid Y coordinates'

AssertionError: Rectangles must contain 4 coordinates
print normalize_rectangle( (4.0, 2.0, 1.0, 5.0) ) # X axis inverted
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-23-f05ae7878a45> in <module>()
----> 1 print normalize_rectangle( (4.0, 2.0, 1.0, 5.0) ) # X axis inverted

<ipython-input-21-fdb49ef456c2> in normalize_rectangle(rect)
      2     assert len(rect) == 4, 'Rectangles must contain 4 coordinates'
      3     x0, y0, x1, y1 = rect
----> 4     assert x0 < x1, 'Invalid X coordinates'
      5     assert y0 < y1, 'Invalid Y coordinates'
      6 

AssertionError: Invalid X coordinates

The post-conditions help us catch bugs by telling us when our calculations cannot have been correct. For example, if we normalize a rectangle that is taller than it is wide everything seems OK:

print normalize_rectangle( (0.0, 0.0, 1.0, 5.0) )
(0, 0, 0.2, 1.0)

but if we normalize one that’s wider than it is tall, the assertion is triggered:

print normalize_rectangle( (0.0, 0.0, 5.0, 1.0) )
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-27-5f0ef7954aeb> in <module>()
----> 1 print normalize_rectangle( (0.0, 0.0, 5.0, 1.0) )

<ipython-input-21-fdb49ef456c2> in normalize_rectangle(rect)
     15 
     16     assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
---> 17     assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
     18 
     19     return (0, 0, upper_x, upper_y)

AssertionError: Calculated upper Y coordinate invalid

Re-reading our function, we realize that line 10 should divide dy by dx rather than dx by dy. (You can display line numbers by typing Ctrl-M, then L.) If we had left out the assertion at the end of the function, we would have created and returned something that looked like a valid answer, but wasn’t; detecting and debugging that would almost certainly have taken more time in the long run than writing the assertion.

But assertions aren’t just about catching errors: they also help people understand programs. Each assertion gives the person reading the program a chance to check (consciously or otherwise) that their understanding matches what the code is doing.

Most good programmers follow two rules when adding assertions to their code. The first is, “fail early, fail often”. The greater the distance between when and where an error occurs and when it’s noticed, the harder the error will be to debug, so good code catches mistakes as early as possible.

The second rule is, “turns bugs into assertions or tests”. If you made a mistake in a piece of code, the odds are good that you have made other mistakes nearby, or will make the same mistake (or a related one) the next time you change it. Writing assertions to check that you haven’t regressed (i.e., haven’t re-introduced an old problem) can save a lot of time in the long run, and helps to warn people who are reading the code (including your future self) that this bit is tricky.

Challenges

  1. Suppose you are writing a function called average that calculates the average of the numbers in a list. What pre-conditions and post-conditions would you write for it? Compare your answer to your neighbor’s: can you think of a function that will past your tests but not hers or vice versa?

  2. Explain in words what the assertions in this code check, and for each one, give an example of input that will make that assertion fail.

    def running(values):
        assert len(values) > 0
        result = [values[0]]
        for v in values[1:]:
            assert result[-1] >= 0
            result.append(result[-1] + v)
        assert result[-1] >= result[0]
        return result

Exceptions

Assertions help us catch errors in our code, but things can go wrong for other reasons, like missing or badly-formatted files. Most modern programming languages allow programmers to use exceptions to separate what the program should do if everything goes right from what it should do if something goes wrong. Doing this makes both cases easier to read and understand.

For example, here’s a small piece of code that tries to read parameters and a grid from two separate files, and reports an error if either goes wrong:

try:
    params = read_params(param_file)
    grid = read_grid(grid_file)
except:
    log.error('Failed to read input file(s)')
    sys.exit(ERROR)

We join the normal case and the error-handling code using the keywords try and except. These work together like if and else: the statements under the try are what should happen if everything works, while the statements under except are what the program should do if something goes wrong.

We have actually seen exceptions before without knowing it, since by default, when an exception occurs, Python prints it out and halts our program. For example, trying to open a nonexistent file triggers a type of exception called an IOError, while an out-of-bounds index to a list triggers an IndexError:

open('nonexistent-file.txt', 'r')
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-13-58cbde3dd63c> in <module>()
----> 1 open('nonexistent-file.txt', 'r')

IOError: [Errno 2] No such file or directory: 'nonexistent-file.txt'
values = [0, 1, 2]
print values[999]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-14-7fed13afc650> in <module>()
      1 values = [0, 1, 2]
----> 2 print values[999]

IndexError: list index out of range

We can use try and except to deal with these errors ourselves if we don’t want the program simply to fall over:

try:
    reader = open('nonexistent-file.txt', 'r')
except IOError:
    print 'Whoops!'
Whoops!

When Python executes this code, it runs the statement inside the try. If that works, it skips over the except block without running it. If an exception occurs inside the try block, though, Python compares the type of the exception to the type specified by the except. If they match, it executes the code in the except block.

IOError is the particular kind of exception Python raises when there is a problem related to input and output, such as files not existing or the program not having the permissions it needs to read them. We can put as many lines of code in a try block as we want, just as we can put many statements under an if. We can also handle several different kinds of errors afterward. For example, here’s some code to calculate the entropy at each point in a grid:

try:
    params = read_params(param_file)
    grid = read_grid(grid_file)
    entropy = lee_entropy(params, grid)
    write_entropy(entropy_file, entropy)
except IOError:
    report_error_and_exit('IO error')
except ArithmeticError:
    report_error_and_exit('Arithmetic error')

Python tries to run the four functions inside the try as normal. If an error occurs in any of them, Python immediately jumps down and tries to find an except of the corresponding type: if the exception is an IOError, Python jumps into the first error handler, while if it’s an ArithmeticError, Python jumps into the second handler instead. It will only execute one of these, just as it will only execute one branch of a series of if/elif/else statements.

This layout has made the code easier to read, but we’ve lost something important: the message printed out by the IOError branch doesn’t tell us which file caused the problem. We can do better if we capture and hang on to the object that Python creates to record information about the error:

try:
    params = read_params(param_file)
    grid = read_grid(grid_file)
    entropy = lee_entropy(params, grid)
    write_entropy(entropy_file, entropy)
except IOError as err:
    report_error_and_exit('Cannot read/write' + err.filename)
except ArithmeticError as err:
    report_error_and_exit(err.message)

If something goes wrong in the try, Python creates an exception object, fills it with information, and assigns it to the variable err. (There’s nothing special about this variable name—we can use anything we want.) Exactly what information is recorded depends on what kind of error occurred; Python’s documentation describes the properties of each type of error in detail, but we can always just print the exception object. In the case of an I/O error, we print out the name of the file that caused the problem. And in the case of an arithmetic error, printing out the message embedded in the exception object is what Python would have done anyway.

So much for how exceptions work: how should they be used? Some programmers use try and except to give their programs default behaviors. For example, if this code can’t read the grid file that the user has asked for, it creates a default grid instead:

try:
    grid = read_grid(grid_file)
except IOError:
    grid = default_grid()

Other programmers would explicitly test for the grid file, and use if and else for control flow:

if file_exists(grid_file):
    grid = read_grid(grid_file)
else:
    grid = default_grid()

It’s mostly a matter of taste, but we prefer the second style. As a rule, exceptions should only be used to handle exceptional cases. If the program knows how to fall back to a default grid, that’s not an unexpected event. Using if and else instead of try and except sends different signals to anyone reading our code, even if they do the same thing.

Novices often ask another question about exception handling style as well, but before we address it, there’s something in our example that you might not have noticed. Exceptions can actually be thrown a long way: they don’t have to be handled immediately. Take another look at this code:

try:
    params = read_params(param_file)
    grid = read_grid(grid_file)
    entropy = lee_entropy(params, grid)
    write_entropy(entropy_file, entropy)
except IOError as err:
    report_error_and_exit('Cannot read/write' + err.filename)
except ArithmeticError as err:
    report_error_and_exit(err.message)

The four lines in the try block are all function calls. They might catch and handle exceptions themselves, but if an exception occurs in one of them that isn’t handled internally, Python looks in the calling code for a matching except. If it doesn’t find one there, it looks in that function’s caller, and so on. If we get all the way back to the main program without finding an exception handler, Python’s default behavior is to print an error message like the ones we’ve been seeing all along.

This rule is the origin of the rule “Throw Low, Catch High.” There are many places in our program where an error might occur. There are only a few, though, where errors can sensibly be handled. For example, a linear algebra library doesn’t know whether it’s being called directly from the Python interpreter, or whether it’s being used as a component in a larger program. In the latter case, the library doesn’t know if the program that’s calling it is being run from the command line or from a GUI. The library therefore shouldn’t try to handle or report errors itself, because it has no way of knowing what the right way to do this is. It should instead just raise an exception, and let its caller figure out how best to handle it.

Finally, we can raise exceptions ourselves if we want to. In fact, we should do this, since it’s the standard way in Python to signal that something has gone wrong. Here, for example, is a function that reads a grid and checks its consistency:

def read_grid(grid_file):
    '''Read grid, checking consistency.'''

    data = read_raw_data(grid_file)
    if not grid_consistent(data):
        raise Exception('Inconsistent grid: ' + grid_file)
    result = normalize_grid(data)

    return result

The raise statement creates a new exception with a meaningful error message. Since read_grid itself doesn’t contain a try/except block, this exception will always be thrown up and out of the function, to be caught and handled by whoever is calling read_grid. We can define new types of exceptions if we want to. And we should, so that errors in our code can be distinguished from errors in other people’s code. However, this involves classes and objects, which is outside the scope of these lessons.

Challenges

  1. Modify the program below so that it prints three lines of output.

    try:
        for number in [-1, 0, 1]:
            print 1.0/number
    except ZeroDivisionError:
        print 'whoops'

The Limits to Testing

Like any other piece of experimental apparatus, a complex program requires a much higher investment in testing than a simple one. Putting it another way, a small script that is only going to be used once, to produce one figure, probably doesn’t need separate testing: its output is either correct or not. A linear algebra library that will be used by thousands of people in twice that number of applications over the course of a decade, on the other hand, definitely does.

Unfortunately, it’s practically impossible to prove that a program will always do what it’s supposed to. To see why, consider a function that checks whether a character strings contains only the letters ‘A’, ‘C’, ‘G’, and ‘T’. These four tests clearly aren’t sufficient:

assert is_all_bases('A')
assert is_all_bases('C')
assert is_all_bases('G')
assert is_all_bases('T')

because this version of is_all_bases passes them:

def is_all_bases(bases):
    return True

Adding these tests isn’t enough:

assert not is_all_bases('X')
assert not is_all_bases('Y')
assert not is_all_bases('Z')

because this version still passes:

def is_all_bases(bases):
    return bases[0] in 'ACGT'

We can add yet more tests:

assert is_all_bases('ACGCGA')
assert not is_all_bases('CGAZ')

but no matter how many we have, we can always write a function that passes them, but does the wrong thing in other cases. And as we add more tests, we have to start worrying about whether the tests themselves are correct, and about whether we can afford the time needed to write them. After all, if we really want to check that the square root function is correct for all values between 0.0 and 1.0, we need to write over a billion test cases; that’s a lot of typing, and the chances of us getting every one right are effectively zero.

Testing is still worth doing, though: it’s one of those things that doesn’t work in theory, but is surprisingly effective in practice. If we choose our tests carefully, we can demonstrate that our software is as likely to be correct as a mathematical proof or a physical experiment.

Ensuring that we have the right answer is only one reason to to software. The other is that it speeds up development by reducing the amount of re-work we have to do. Even small programs can be quite complex, and changing one thing can all too easily break something else. If we test changes as we make them, and automatically re-test things we’ve already done, we can catch and fix errors while the changes are still fresh in our minds.

Unit Testing

Most people don’t enjoy writing tests, so if we want them to actually do it, it must be easy to:

  • add or change tests,
  • understand the tests that have already been written,
  • run those tests, and
  • understand those tests’ results.

Test results must also be reliable. If a testing tool says that code is working when it’s not, or reports problems when there actually aren’t any, people will lose faith in it and stop using it.

The simplest kind of test is a unit test that checks the behavior of one component of a program. As an example, suppose we’re testing a function called rectangle_area that returns the area of an (x0, y0, x1, y1) rectangle. We’ll start by testing our code directly using assert. Here, we call the function three times with different arguments, checking that the right value is returned each time.

from rectangle import rectangle_area

assert rectangle_area([0, 0, 1, 1]) == 1.0
assert rectangle_area([1, 1, 4, 4]) == 9.0
assert rectangle_area([0, 1, 4, 7]) == 24.0
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-16-ebf7f5f1c120> in <module>()
      3 assert rectangle_area([0, 0, 1, 1]) == 1.0
      4 assert rectangle_area([1, 1, 4, 4]) == 9.0
----> 5 assert rectangle_area([0, 1, 4, 7]) == 24.0

AssertionError: 

This result is used, in the sense that we know something’s wrong, but look closely at what happens if we run the tests in a different order:

assert rectangle_area([0, 1, 4, 7]) == 24.0
assert rectangle_area([1, 1, 4, 4]) == 9.0
assert rectangle_area([0, 0, 1, 1]) == 1.0
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-17-548f3f32c981> in <module>()
----> 1 assert rectangle_area([0, 1, 4, 7]) == 24.0
      2 assert rectangle_area([1, 1, 4, 4]) == 9.0
      3 assert rectangle_area([0, 0, 1, 1]) == 1.0

AssertionError: 

Python halts at the first failed assertion, so the second and third tests aren’t run at all. It would be more helpful if we could get data from all of our tests every time they’re run, since the more information we have, the faster we’re likely to be able to track down bugs. It would also be helpful to have some kind of summary report: if our test suite includes thirty or forty tests (as it well might for a complex function or library that’s widely used), we’d like to know how many passed or failed.

Here’s a different approach. First, let’s put each test in a function with a meaningful name:

def test_unit_square():
    assert rectangle_area([0, 0, 1, 1]) == 1.0

def test_large_square():
    assert rectangle_area([1, 1, 4, 4]) == 9.0

def test_actual_rectangle():
    assert rectangle_area([0, 1, 4, 7]) == 24.0

Next, import a library called ears and ask it to run our tests for us:

import ears
ears.run()
..f
2 pass, 1 fail, 0 error
----------------------------------------
fail: test_actual_rectangle
Traceback (most recent call last):
  File "ears.py", line 45, in run
    test()
  File "<ipython-input-18-643689ad0a0f>", line 8, in test_actual_rectangle
    assert rectangle_area([0, 1, 4, 7]) == 24.0
AssertionError

ears.run looks in the calling program for functions whose names start with the letters 'test_' and runs each one. If the function complete without an assertion being triggered, we count the test as a success. If an assertion fails, we count the test as a failure, but if any other exception occurs, we count it as an error because the odds are that the test itself is broken.

ears is an xUnit testing library. The name “xUnit” comes from the fact that many of them are imitations of a Java testing library called JUnit. The Wikipedia page on the subject lists dozens of similar frameworks in almost as many languages, all of which have a similar structure: each test is a single function that follows some naming convention (e.g., starts with 'test_'), and the framework runs them in some order and reports how many passed, failed, or were broken.

Challenges

  1. A colleague of yours has written a function that calculates the running total of the values in a list, e.g., running([0, 1, 2]) produces the list [0, 1, 3]. Load this function into your notebook using from running import running, and then write some unit tests for it using the ears library to see what bugs you can find.

  2. Some programmers put assertions in their programs to catch errors when they occur; others prefer to write unit tests to check that the program is behaving properly. Which do you think makes programs easier to read? Which do you think makes them easier to maintain as they change over time?

Test-Driven Development

Libraries like ear can’t think of test cases for us. We still have to decide what to test and how many tests to run. Our best guide here is economics: we want the tests that are most likely to give us useful information that we don’t already have. For example, if rectangle_area([0, 0, 1, 1]) works, there’s probably not much point testing rectangle_area((0, 0, 2, 2)), since it’s hard to think of a bug that would show up in one case but not in the other.

We should therefore try to choose tests that are as different from each other as possible, so that we force the code we’re testing to execute in all the different ways it can. Another way of thinking about this is that we should try to find boundary cases. If a function works for zero, one, and a million values, it will probably work for eighteen values.

Using boundary values as tests has another advantage: it can help us design our software. To see how, consider this test case for our rectangle area function:

def test_inverted_rectangle():
    assert rectangle_area([1, 5, 5, 2]) == -12.0

Is that test correct? I.e., are rectangles with x1<x0 or y1<y0 legal, and do they have negative area? Or should the test be:

def test_inverted_rectangle():
    try:
        rectangle_area([1, 5, 5, 2])
        assert False, 'Function did not raise exception for invalid rectangle'
    except ValueError:
        pass # rectangle_area failed with the expected kind of exception
    except Exception:
        assert False, 'Function did not raise correct kind of exception for invalid rectangle'

The logic in this second version may take a moment to work out, but the idea is straightforward: we want to check that rectangle_area raises a ValueError exception if it’s given a rectangle whose upper edge is below or to the left of its lower edge.

Here’s another test case that can help us design our software:

def test_zero_width():
    assert rectangle_area([2, 1, 2, 8]) == 0

We might decide that rectangles with negative areas aren’t allowed, but what about rectangles with zero area, i.e., rectangles that are actually lines? Any actual implementation of rectangle_area will do something with one of these; writing unit tests for boundary cases is a good way to specify exactly what that something is.

Unit tests are actually such a good way to define how functions ought to behave that many programmers use a practice called test-driven development (TDD). Instead of writing code, then figuring out how to test it, these programmers:

  1. write some unit tests for a function that doesn’t exist yet,
  2. write that function,
  3. modify it until it passes all of the tests, then
  4. clean up the function, i.e., make it more readable or more efficient without breaking any of the tests.

The mantra often used during TDD is “red, green, refactor”: get a red light (i.e., some failing tests), make it turn green (i.e., get something working), and then clean it up by refactoring. This cycle should take anywhere from a couple of minutes to an hour or so. If it takes longer than that, the change being made is probably too large, and should be broken down into smaller (and more comprehensible) steps.

TDD’s proponents argue that it helps people produce better code for two reasons. First, it encourages them to write code in small, self-contained chunks, and to actually write tests for those chunks. Second, it frees them from confirmation bias: since they haven’t written their function yet, their subconscious cannot steer their testing toward proving it correct rather than finding errors.

Empirical studies of TDD have had mixed results: some have found it beneficial, while others have found no effect. But even if you don’t use it day to day, trying it a few times helps you learn how to design functions and programs that are easier to test.

Challenges

  1. Write a function called something that passes the following unit tests:

    def test_empty():
        assert something([]) == []
    
    def test_single_value():
        assert something(['a']) == []
    
    def test_two_values():
        assert something(['a', 'b']) == [('a', 'b')]
    
    def test_three_values():
        assert something(['a', 'b', 'c']) == [('a', 'b'), ('a', 'c'), ('b', 'c')]

Next Steps

We have now seen the basics of building and testing Python code in the IPython Notebook. The last thing we need to learn is how to build command-line programs that we can use in pipelines and shell scripts, so that we can integrate our tools with other people’s work. This will be the subject of our next and final lesson.