Remain resilient, fail fast

The network is not reliable. We know it and we build our systems accordingly. We use design patterns such as circuit breakers and bulkheads to isolate problems and contain their repercussions. Whenever our PostgreSQL instance, our Kafka cluster, or our neighbouring microservice become unavailable, we patiently wait. We continue to operate, albeit in a limited capacity. We make sure that our systems withstand temporary problems and can recover from them.

There is one moment in our application’s life cycle when this principle does not apply: the application start-up phase. If a newly started program cannot establish a connection to a service it depends on, it should immediately crash. Don’t try again, don’t open your circuit breaker. Give up right away. Shut down and return a non-zero exit status.

The idea to fail fast at the code level is not new. It’s a good principle. If the invoked function cannot compute a correct result, if arguments you passed to a constructor are invalid, or if the called object is in an incorrect state, don’t try to make things up. Throw an exception or use your languages' idioms to make sure that someone notices the problem. Faults swiped under the rug will result in a crash in the least expected moment, along with the least obvious stack trace.

Initialise, or else

We can divide the runtime behaviour of most software we write into two parts: the start-up phase and the operating phase. grep known from POSIX shells has two phases. First, it reads its command line arguments, analyses the given regular expression, and opens its input files. Then, it enters a loop; it reads data, searches for the pattern, and prints matching results. There is no point in reading any input data if the regular expression is malformed.

The same applies to back end software which relies on remote services to provide desired functionality. Just like grep, our imaginary application runs in two phases. It starts up, reads configuration from a file or requests it from a service discovery daemon, opens necessary TCP ports and connections, sometimes runs a health check. Then, just like grep, it enters a loop. It reads data from a socket, interprets it as HTTP requests, runs some business-specific rules on it, and inserts a row in a PostgreSQL database.

Let’s assume that for some reason our PostgreSQL instance is inaccessible. Our application starts up, notices that it cannot open a connection pool to the database, and instead of crashing it opens a circuit breaker. It waits in vain for the PostgreSQL to come back online. Luckily, the application exposes an HTTP health check resource, which our load balancer is regularly requesting. With PostgreSQL inaccessible, health checks are red, and after a handful of requests the load balancer evicts the faulty instance.

What would the situation look like had our application failed fast? Upon its initialisation it would attempt to connect to PostgreSQL. With the database inaccessible, the application would simply crash. The process would terminate right away, perhaps throwing an exception with an explanation of the problem. The responsibility would move elsewhere. Your process would be most likely supervised by a tool like systemd, run in a Docker container, or managed in a Kubernetes pod. Either way, its supervisor would learn immediately that the application crashed. The reaction and the follow-up recovery would happen without delay. Your feedback loop would become tighter.

No time for optimism

Now, you might rightfully ask, what if the PostgreSQL is inaccessible just for a second or two? What if it comes back soon? Why shouldn’t I use resilience mechanisms I have in place to reconnect to it? After all a network loss is not a malformed regex; it can heal itself.

The decision boils down to our prior observations. Let’s assume that your application started successfully and was connected to all its dependent services. If the connection is lost, you know that at some point in past there was a service available under that IP address. Trying to connect again will likely fix the problem.

However, if your application cannot tell that the service it needs to communicate with has ever been available under a given host name, is our optimism justified? What if there is a typo in the configuration? What if the EC2 instance you deployed your application on belongs to a security group which is disallowed from accessing your database? Waiting will not help. Just crash and provide feedback immediately. Bring attention where it is needed.

Making our software able to withstand connectivity issues is the sine qua non of resilient systems. Tools you use to achieve this goal should allow you to differentiate between two phases of applications' life cycle. If your application is up and running, make sure it won’t trip when it loses a connection. At the start-up time, however, there’s no time for optimism. There’s always someone, or something, relying on feedback; don’t keep them waiting.

Blog-Post

Remain resilient, fail fast

Initialise, or else

No time for optimism

TAGS