Errors and Partial Functions

July 10, 2019
programming

What is an error? and should you model them with exceptions or return values? Here’s an overview for what an error is and how you should write code to handle it.

A Motivating Example 🔗︎

Let’s start by considering the Spin-Wheel action within the game of “Wheel of Fortune”. Most people would divide the possible outcomes in the following ways:

  • Valid Outputs:
    • $500
    • $1000
    • JACKPOT
    • BANKRUPT
    • LOSE_TURN
  • Errors:
    • WEAK_SPIN
    • WHEEL_ON_FIRE

This raises the question - “Why do we think of WEAK_SPIN as an error?” Wouldn’t it be just as valid to model the Spin-Wheel action so that it returns a SpinWheelResult instead of a Wedge, which could then have WheelLogicViolation as a return value, making WEAK_SPIN an actual output value instead of an error.

This is one illustration of the idea that nothing is normatively an error - they are simply different branches of behavior that need to be modeled - and it is a matter of perspective that we consider them to be errors.

SpinWheel => Wedge | Error
SpinWheel => SpinWheelResult = Wedge | LogicViolation

Indeed, if errors are simply other branches of behavior, it raises another question: Why do we even need error constructs? We already have mechanisms such as if/else statements and polymorphism that allow us to handle branches - why don’t we just use those for everything?

Errors are ultimately a structural concept derived from a useful underlying formalization with common patterns for handling these branches. As we’ll explore later on, it’s exactly right that we’ll want to use the language features we’re already comfortable with in order to handle them.

Partial Functions 🔗︎

A function f: Domain -> Codomain is a mapping for elements of a Domain to elements of a Codomain.

A total function is one that is defined for all elements of the Domain, in contrast to a partial function, which is undefined for certain inputs.

A simple example of this is division /: (R,R) -> R, which is undefined for the input denominator of 0. Of course, if you had instead defined division as /: (R,R\{0}) -> R, it would be a total function.

Arbitrariness of Totality 🔗︎

Totality and Partiality are ultimately modeling properties whose ultimate expression is limited by the strength of your type system. The more complex Domain-Sets you can express in your type system, the more functions you can directly express as Total without having to resort to some of the other tricks we’ll discuss later.

But in many ways, the functions you define are independent from your type system modeling - the two division operators above are both “division” in the sense that matters, but one of them better precludes inputs that are nonsensical for that operator.

Partiality-Error Equivalence 🔗︎

The second insight is that our concept of “Error” maps 1-for-1 with any cause of Partiality for a Partial Function.

For example, a DivideByZero error is simply a cause of Partiality for a division operator that allows 0 in its Domain.

Pure Partiality 🔗︎

There are two major kinds of Partiality that exist when writing software. The first is Pure Partiality, which is concerned with the errors that relate to input values with no corresponding mapped output value in the Codomain.

There are two sources for this:

  • Static Enforcement: The point of describing your Domain and Codomain is to only have those values enter and exit your function. A dynamic type system, while allowing you to describe your function, does nothing to prevent invalid values from runtime evaluation.
    • Example: passing a string into the division function in Python
  • Imprecise Definition: Your Domain may be specified in a way such that there are values with no corresponding mapped values in the Codomain.
    • Example: passing in zero to the denominator of the division function in Java
    • Example: querying a database for an ID that doesn’t exist

In a sense, these two kinds of “Pure Partiality” errors are variants of each other. Errors are ultimately something that happen when your software runs, so if your Type System doesn’t statically prevent invalid values from being passed in, you can think of every function you write as being of type f: Any -> Any, which means that you have an extremely wide variety of Imprecise Definition errors to handle at runtime.

This is one of the reasons I believe type-inferred static typing to be a major advantage in language design. A dynamic language is fundamentally typed at runtime with a single type Any, which means that to properly handle the Imprecise Definition errors, you need to assert the correct type and then handle other inputs for every function. It’s simply less code to use a static type system, which gives you a concise way to make the type assertion and enforces types at compile time, which obviates the need to write the error handling logic at all.

It’s important not to go off the deep end with strong type systems, however, as a consequence of the Halting Problem is that no matter how powerful your type system is, there will always be a function whose “exact” Domain cannot be expressed, since a compiler is ultimately a program and cannot enforce the Domain of an undecidable function.

IO Partiality 🔗︎

There is a second kind of Partiality of deep and fundamental importance, which is that our functions run on machines and do things in the real world, which has two main sources:

  • Physical Layer: every function ultimately runs against a real machine despite our virtualized programming model. Therefore, every single function is Partial, since there are physical constraints (such as OutOfMemory) that can cause potentially any function to fail.
  • Side-Effect: Virtually every function that is of practical use will perform some side-effect in the real world that may fail.

Good Error Code 🔗︎

Given what we now know, let us consider the ideal properties our error code should possess:

  • Authors of code should be encouraged to write total functions by explicitly modeling all errors.
  • Callers of code should be forced to handle all branches of behavior explicitly.
  • These branches should ideally be handled with the same branching mechanisms all other code in the language is written with.

Total Your Functions 🔗︎

The essence of robust software isn’t to eliminate errors, as errors simply represent unusual branches of behavior in our modeling. Rather, robust software exhaustively handles all branches in intentional fashion, instead of defaulting to the language runtime’s behavior (which may or may not be correct).

Error handling mechanisms exist to help you convert Partial Functions into Total Functions, and if there’s one thing you should do, it’s to always attempt to “Total” your functions with these mechanisms.

Consider:

public static Rational divide(Rational numerator, Rational denominator) throws ArithmeticException

This signature is essentially /: (Q,Q) -> Rational || ArithmeticException, which is Total!

Sentinel Values 🔗︎

I call it my billion-dollar mistake. It was the invention of the null reference in 1965.

— Tony Hoare

The first error handling mechanism we’ll consider is the Sentinel Value. This is the act of returning a value that acts as the “sentinel” and represents an error (such as null or -1). There are a number of risks with sentinel values:

  • Sentinel Value errors are implicitly contextualized (-1 means different things depending on where the error occurred), which means that your handling must be maximally localized - invoke the function and handle the error immediately before passing the return value anywhere else (ideally before even binding it to a variable).
  • This also implies that you should impeccably document your return values since that and reading source code is the only way your callers will know those errors exist.
  • Try to use things like a coalescing operator to keep this compact.
$username = $_GET['username'] ?? 'not passed';

Exceptions 🔗︎

The second mechanism we’ll consider is the Exception. As an evolution in response to Sentinel Values, Exceptions have two major things going for them:

  1. A separate value with explicit control flow implications made it very difficult for people to silently and accidentally ignore an error.
  2. Unchecked exceptions also allow us to avoid contaminating literally every single function signature with a error return type because of the partiality causes that result from physical-layer failures, such as OutOfMemory.

Unfortunately, Exceptions also cause a few problems:

  1. They are essentially a sum-type for the return value, but inhabit a completely parallel space in the function signature.
  2. Try/Catch is a completely parallel control flow structure to the normal branching mechanisms of most languages.
  3. The coupling of a particular control flow to errors makes it extremely easy for errors to leak well beyond the domains they should be encapsulated within.

Keep the following in mind:

  • Bias towards using checked exceptions. Representing your true return values in your type signature will recruit the compiler to help force your callers to robustly handle all cases as well as minimize documentation.
  • Localize your try/catch clauses as much as possible. Handle the exception and then translate to a default or other value as soon as possible so you can go back to using normal branching mechanisms (if/else, switch, polymorphism, etc.)
  • Wrap exceptions instead of blindly propagating them. If your system calls a database under the hood, don’t catch-log-rethrow - instead, catch and throw a more domain-specific exception (e.g. MySystemException())

Error Monads 🔗︎

The last model we’ll consider are the error monads. These are the Option, Maybe, Try, and Results of the world.

They are containers, which means that you’ll have to crack them open (and handle errors) in order to get at the real value, but it also means that the errors are all well-contextualized, making it possible to pass them around as values safely. When you have the power of these tools available to you:

  • Avoid using them as if they were alternatives for if/else statements - methods like .isPresent() or .get() should generally be avoided. Methods like .filter() or .getOrElse() are generally preferable.
  • Use .map() in order to defer accessing the actual value, and use .flatMap() in order to cleanly chain with other partial functions.

Code Example: Division 🔗︎

If you look at a basic division example, a typical method might have a signature like this:

public static Integer divide(Rational numerator, Rational denominator) throws DivideByZeroException

Whereas I find an approach like this more compelling (intentionally made more verbose in order to illustrate what’s happening):

public static Try<Integer> divide(Rational numerator, Rational denominator) {
    if (denominator.isZero()) {
        return Try.of(() -> { throw new DivideByZeroException; });
    } else {
        return Try.of(numerator / denominator)
    }
}

The compiler forces you to handle this robustly:

_.divide(1, 0) + 1; // Doesn't compile
_.divide(1, 0).map(d -> d.toString).getOrElse("Oops"); // "Oops"

But you can still chain together these operations without too much rigamarole either:

_.divide(10,5).flatMap(d -> _.divide(d, 0)).map(d -> d.toString).getOrElse("Oops");
// "Oops"

If you want to experiment more with this style, I’ve put together a couple of basic exercises that you can try out.

Well-Modeled Errors 🔗︎

Error handling code is still code, which means those remediation routines should follow the same Modularity principles that all good code follows.

There are a few heuristics that are particularly useful:

  • Authority
  • End-to-End Principle
  • Volatility Risk
  • Branch Elimination

Authority 🔗︎

The first major question to ask is: Is the system with enough knowledge and context the one to properly handle the error?

Let’s say you were trying to send a package with UPS.

If the error was “a truck broke down”, the entire contract I have with UPS is that I don’t want to care about how a package was transported - UPS would be the correct authority for handling that error (by re-routing the package, etc.) and it should never bubble up to me.

On the other hand, if the error was “the address is invalid”, the only possible authority for remedy-ing this error is me since I’m the only one who really knows who I am trying to send the package to. Magical auto-correct of addresses can be attempted, but at the end of the day, I am the authority for this error.

Physical Layer 🔗︎

There’s an interesting corollary of the Authority Heuristic in terms of how to deal with the Physical Layer Failures.

The truth about the Physical Layer is that those errors are both inevitable and unhandleable. Code running on a machine cannot remediate failures of the machine, which makes those errors not even worth modeling.

The reality is that errors are errors within their context, but there are always other systems that watch systems, and to those systems - errors are simply eventualities to be managed. For example, the processes of a webservice are managed by a daemon, which are monitored by a metrics system, which are watched by a human, which is usually arrayed in a self-replicating redundant form.

WHEEL_ON_FIRE (from the Wheel of Fortune case study) is a perfect example of this - the meta-context of the game (i.e. the TV show production) will have processes in place to handle errors that cascade through all layers (Turn off the TV broadcast and display a “We Are Experiencing Technical Difficulties” message, while people try and figure out what to do to get the game state fixed).

When inevitable and unhandleable errors occur, Process Teardown should be how we proceed, and recovery must be delegated to the meta-system (whether daemon or human).

This then means that all code should be written as if it could fail and terminate at any moment of time, and that the resulting state should be such that the meta-system’s recovery process will result in correct application state.

End-to-End Principle 🔗︎

When combined with the Authority Heuristic, the End-to-End principle is critical for understanding Retries.

Devs typically think of retries as error handling, but it turns out that they do not help with error handling.

A retry policy can reduce the propagation of errors but cannot eliminate any error. A retry policy will either generate an infinite loop (infinite retries == infinite loop) or the error will simply propagate outward more rarely (but will still propagate).

Fundamentally, some system has the authority to execute full and total remediation of the error, and no retry policy can be inserted in between that gets around the fact that that system must understand what to do with that error.

Retries are ultimately a reliability optimization, not a substitute for robustness.

Volatility Risk 🔗︎

Error handling code is code, which means that the same consideration of avoiding entangling components that are vulnerable to change you would apply to your happy path should apply here. Your error values and their associated remediation should encapsulate so that the enclosing scope has no need to understand the error.

A common area for this is DB IO:

val myVal = query_database_for_id(1)

This method results in SQLException propagating to who knows where, meaning anyone could write catch (SQLException e) and now the fact you are using a SQL database has leaked across your code.

Branch Elimination 🔗︎

The core idea behind branch elimination is that since we have the flexibility to define our domains however we wish, we should define them in terms that eliminate branches entirely.

John Ousterhout describes this as “define errors out of existence”, and his first example is an unset command that takes a variable and removes it. If the variable doesn’t exist, his implementation, which he now regrets, throws an error. Alternatively, however, it could simply succeed.

These differences boil down to the semantics you wish to model with your function - is this a specific synchronous operation you’d like to be performed? Or is it perhaps a declarative statement of what you’d like the world to look like (this variable should be unset)? Either is perfectly reasonable, and which is valid depends on your caller’s needs, but one of them certainly has fewer cases that need to be exposed as an “error”.

Static Service Configuration

May 30, 2019
programming

Conceptual Arbitrariness

May 15, 2019
programming

Algebra For Accounting

March 27, 2019
programming