Dec 29
2020

Implementing ON ERROR RESUME NEXT in Python

Summary

I wrote ON ERROR RESUME NEXT for Python as a proof of concept. Code is here.

Intro

On the whole, BASIC is an unfairly maligned programming language. By the 90s it had come a long way from the GOTO-peppered, line-numbered morass which earned Dijkstra's ire. Modern BASICs have great support for structured programming -- not just procedures and while loops, but classes and modules, too. However, there is one famous BASIC feature which is every bit as evil as it sounds.

Errors in early BASICs tended to be fatal. This meant that writing robust programs meant doing a lot of careful checking before any potentially-dangerous operation. (Writing non-robust programs meant getting used to restarting them a lot). In many ways, the error handling environment of BASIC was more difficult to work with than other languages of the time, such as C, with the result that many people wished instead that they could just make errors go away so they didn’t have to think about them. Enter ON ERROR RESUME NEXT.

When a BASIC interpreter encounters ON ERROR RESUME NEXT, it stops reporting errors to the program. That’s it. Oh sure, they still happen, but you don’t need to worry about them now. When an error occurs BASIC will just continue on the next line of the program. Open a file which doesn’t exist? No worries. Divide by zero? Sure. Every path is the happy path now. (1)

So anyway, I've been deep in Python internals recently, and one night while drinking heavily (2) the thought occurred to me: could we implement ON ERROR RESUME NEXT in Python? Could we just have the interpreter swallow exceptions and keep going? Turns out we pretty much can!

Just to be totally clear because humour is difficult on the Internet

Yes, I did actually write this. No, I don’t recommend anyone use it. But the implementation and revealed information about how Python works is pretty interesting.

How it ought to work

Python uses the “termination” model of error handling: an exception handler can find out what happened and continue execution at an outer level, but it cannot repair the cause of the error and retry the failing operation.

Python documentation, sweet summer child

When something goes wrong in Python, it raises an exception. If the exception isn’t handled by the function which generated it, it propagates back up the call stack until it reaches an exception handler. If the exception handler can’t handle that type of exception, it continues its journey upwards. If there are no suitable exception handlers, the exception propagates all the way up to the top level, and the Python interpreter itself handles it by printing a traceback and exiting. Very BASIC-like, that part.

The interesting thing about Python is that its tracebacks are very rich. As well as the line number of the failing instruction, they also contain one frame object per traceback level. Frame objects are the magic on which the present abomination turns, because frame objects contain a pointer to the compiled code that was running, the exact Python opcode that failed, the locals and globals which were active in that frame… in fact, almost everything one might need to re-start execution.

If the above has given you a terrible idea, then we should be friends. What it means is that if function1 calls function2() and function2 throws an exception, the traceback will contain frame objects for function1 and function2. Both functions' local variables, instruction pointer, and so on are all faithfully preserved in their respective frame objects, so if we were to somehow re-enter function2, restoring all its state, and jump directly to the the line of code after the one which threw the exception, we could effectively RESUME NEXT

Basically, then, we are going to

  1. Take the bytecode from the frame object and add a portion at the beginning which
  2. Restores the function context we pulled out of the frame stack and
  3. Jumps directly to either the instruction which performs the function call to the next-lowest function in the traceback  (if this wasn’t the frame which produced the exception), or the instruction corresponding to the next line of code (if this was the frame that produced the exception). Then we
  4. Take the top-level patched function, the one at the outermost level of the traceback, and run it.

The top-level function will call all the way down the call  hierarchy pulled out of the traceback, finally reaching the function which raised the exception. That final, leaf function will continue on the line after the one which raised. Simple! ON ERROR RESUME NEXT.

Diversion: catching top-level exceptions

You might be thinking (I was), “Nicholas, this isn’t very BASIC if I have to wrap all our code in an exception handler. That’s exactly the sort of error shenanigans I was hoping to avoid!” Well, fortunately, Python offers sys.excepthook as a last-resort way of handling top-level exceptions. If sys.excepthook is set, exceptions which reach the top level are passed to it. So we can implement our RESUME NEXT from there. More on this later. For now, let’s dive into the horrifying world of Python’s four different stacks.

How it actually works

There are lots of implementations of Python, but unless you’re doing something quite specialised you’ll be using the one available from python.org known as CPython. CPython bytecode is stack-based. In fact, Python’s creators loved stacks so much, they used four of them. They are:

  1. The Python call stack. This is an array of frame objects, one per function invocation. When you call a function, Python creates a new frame object and puts it on this stack. When the function returns, Python pops its frame object off the call stack. Simple.
  2. The C stack. CPython is written in C, hence the name, and its interpreter uses the C stack internally in the way all C programs do, for storing call frames and local variables.
  3. The block stack. When you write an exception handler, Python puts the location of its finally clause (which may be empty if you didn’t write one) on the block stack. It’s also used for with, and a few other things. Each frame has its own block stack, and when an exception is raised this is what’s consulted to find a handler.
  4. The value stack. In a stack-based language like CPython, all data is passed on the stack. A line like value = 42 is translated to bytecode which first pushes 42 onto the stack, and then pops the top of the stack and stores the result in value. Each frame has its own value stack, too.

I wrote above that tracebacks contain “almost everything” required to re-run the code causing an exception. Guess what tracebacks don’t contain? That’s right, the stacks!

(Except the Python call stack, of course. We get that one.)

Getting the stack information even if it kills us

The real killer is the value stack. Because Python uses the value stack for everything, it’s really important that we restore it. As an example, consider this code snippet:

def f(value):
    result = explodey(value)
    [ more processing here ]
    return result

This disassembles to the following bytecode:

    0 LOAD_GLOBAL 0 (explodey)
2 LOAD_FAST 0 (value)
4 CALL_FUNCTION 1
6 STORE_FAST 1 (result)
8 LOAD_FAST 1 (result)
10 RETURN_VALUE

Let’s say that explodey throws an exception, resulting in CALL_FUNCTION failing. We want to resume at the next instruction (3) here, which is STORE_FAST. So we will just jump right back into the function at STORE_FAST, index 6, and… immediately hit a stack underflow error because STORE_FAST is expecting to pop something (the thing it’s storing fast!) off the value stack.

Okay, no problem. The value stack is there, in the frame object. It’s just not obviously accessible to us. Once we have the value stack and the value stack pointer, which tells us how many value stack objects are active, we can pull that number of objects out of the frame and push them back onto the stack when we call the function again to retry. Sounds… simple?

Finding the number of items on the value stack

You can’t. Python doesn’t expose this information in any way, and in fact it couldn’t even if it wanted to because the stack pointer is a local variable stored on the interpreter’s (C) stack, and by the time we’ve got the traceback those (C) stack frames have disappeared.

Alright, so how do we get this information if it’s literally not there? Well, it is there. In a way.

Look again at the disassembly above. We see LOAD_GLOBAL, LOAD_FAST, and so on. Looking at the documentation for these opcode we see that both of these push a value onto the stack. CALL_FUNCTION (1) consumes the number of arguments supplied (1), plus the callable itself, but it then pushes the function’s return value. STORE_FAST pops a value. And so on.

For this function, execution stopped at CALL_FUNCTION. That means we executed LOAD_GLOBAL (add 1 to the stack pointer), LOAD_FAST (add 1 again), and CALL_FUNCTION (subtract 1 for the argument, subtract another 1 for the callable, then add 1 for the return code). So just by looking at the code, we see that by the time CALL_FUNCTION has finished, we have one value on the stack.

Can we formalise this? Of course we can, in the form of an abstract interpreter. We “execute” the Python bytecode, but the only thing we keep track of is the current stack depth. When we reach the instruction of interest (i.e. the one which produced the exception, i.e. CALL_FUNCTION in this example), we stop.

Abstract interpretation sounds hard, but it’s actually easier than you might expect. For example, the stack level before a loop is the same as the stack level after a loop, no matter how many times the loop executes. So we don’t need to worry about how many times a loop will execute. No matter what a function does, CALL_FUNCTION will always push exactly one value. So we don’t need to trace our way through function calls. We do need to worry about opcodes like JUMP_ABSOLUTE, which change the program counter, but that’s relatively straightforward. We also need to worry about conditional jumps — but we can handle those by trying both options (jump taken and jump not taken), one after the other. Our abstract interpreter just wombles through the code, not worrying too much about what it does but maintaining a stack pointer, until its instruction pointer matches the one we’re interested in. Then we stop interpreting and return the stack depth at that point.

Alright! So now we know the number of items we can expect to have on the value stack! Now to retrieve them!

Retrieving the items on the value stack

You can’t. Python doesn’t expose the value stack to Python code in any way. 

Soooo it’s time to pull out the ctypes module and get digging. The frame object is represented in CPython by a struct _frame — and if you visit that link you can see a tantalising PyObject **f_valuestack. (There is also an f_stacktop, which would be really handy to find the number of items on the stack, if Python ever set it!). There are a couple of caveats to just gleefully pulling values out of f_valuestack though. The most important is that CPython uses NULL return values internally (and in extension modules) to signal an exception. If an opcode which writes to the stack produces an exception, the stack value at that point will be NULL, so we need to take care to replace that with something sensible, like None. 

At first it might seem like quite a limitation to have stack values nulled out like this on exceptions, but it makes sense: what is the “return value” of a function which raises an exception? Any possible answer has the same information content (i.e. none), so we don’t lose anything by having to deal with NULL.

Function necromancy

As I hinted above, we can “re-enter” an exception by patching the code for the functions at every level of the traceback so that it restores the internal state of the function to what it was prior to the exception and then resumes where it left off. To make this happen we need to patch the bytecode for each function, and because you can’t modify bytecode what that actually means is that we’ll be creating temporary proxy functions with new bytecode. Doing things this way is also safer, because we can restore different states for the same function call in different frames (think recursion). There are a few things we need to restore:

  • Function locals: The frame object contains the locals. You can’t set function locals directly, but you can pass them as parameters. So the proxy function includes one parameter for each local variable.
  • Stack restoration: We can pass the stack in as a function const (a tuple of values). Restoring it involves loading the const and unpacking it. To accomplish this we add some bytecode to the start of the function.
  • Resuming at the right instruction: we do this by adding a JUMP_ABSOLUTE after the stack restoration.
  • Jump offset fixups: adding code at the start of the function means that all absolute jump targets in the code are invalidated, so we fix those up (by adding the length of our inserted code to them).
  • Other fun things: patching an already-patched function would get confusing, so we also store (in the function’s consts) the original frame (containing the original code object and line numbers etc). That way if we have multiple errors in the same function, we can always work from the “original”. This also means that we need a way to detect an already-patched function, which we do through a special signature (NOP with a nonzero argument as the first opcode).

It’s also easy to forget that traceback frames don’t necessarily represent functions. In fact, the first level of a traceback is quite commonly a module object. So we can’t make function-specific assumptions (such as that locals() is unique or even is distinct from globals().

Wait a minute, what about the block stack

Well yes, about that… it’s not actually vital to get a simple proof of concept going, and, well, I guess you can see where I’m heading with this. It should be fairly straightforward to rebuild it using the same abstract interpreter that calculates stack depth, though, because all the important information (jump targets) is static.

Handling multiple exceptions

Actually, the most fiddly part of this is also a rather boring bit: we will be re-entering our call stack from the sys.excepthook exception handler, but if that exception handler itself throws an exception then Python will unceremoniously kill us off. What this would mean is that we can't ON ERROR RESUME NEXT more than once. The solution, of course, is to wrap the code we're resuming in a try: except: block. This adds some complexity around patching already-patched functions, which I handle by keeping a copy of the original function around. See the code for more details. And speaking of the code...

Code please

The code is at https://github.com/nfd/on_error_resume_next. A good place to start reading is at the beginning of _resume, which is called from sys.excepthook with a fresh traceback.

Because it’s reaching around inside CPython internals, it requires, specifically, CPython 3.9. (It may work on 3.8 as well, possibly with some minor changes to the abstract interpreter)

Examples

handle = open(“probably doesn’t exist”)
if err():
    return False

val = 42
while more_numbers := input():
    val += int(more_numbers)

import is_this_module_installed
if err():
    # oh no

Recapping

I wrote an abstract interpreter, bytecode rewriter and patcher, and CPython frame object extractor just for a stupid joke about error handling.

On the plus side, I now know quite a bit more about how Python works internally.

Does it work?

Well, I’m certain there are loads of bugs. There are also glaring omissions (see block stack). Also it’s a terrible idea. But I put all of these things into a program running ON ERROR RESUME NEXT and the answer appears to be True.

Related work

The sensitive reader will be disquieted to learn that there is in fact an existing implementation of, effectively, ON ERROR RESUME NEXT in Python. It's called fuckit.py and it works by effectively wrapping every single line of code in a try: ... except Exception: pass block. Which is kind of brutal but presumably significantly more robust than this… thing.

(1) To be fair to ON ERROR RESUME NEXT, what you're supposed to do is check err() in the next instruction, which gives you C-style error handling. But it is of course super tempting to just ignore all the errors instead. Again, rather like C.

(2) I'm pretty sure I was sober, actually, but if I admit that then what's my excuse?

(3) Next instruction? What happened to next line? Well it makes sense to resume on the next line in the function which generates the exception, but not on parent (caller) functions. What if explodey did some useful work in the lines after the exception and managed to return a useful value?