Recently I read a tweet from Victor Klang the deputy CTO of Typesafe and one of the core contributors to Akka since the project was initiated:

<wall-of-consent type="twitter'>

While some developers may have a feeling where this is supposed to point us to, others may struggle with similar questions, especially given the current trend towards Self-Contained-Systems or Microservices, which can make teams and systems more independent but also more vulnerable to the fallacies of distributed systems. Initially I thought this blog post is going to be about how to build reliable web clients with Akka and Akka Persistence. While this is an interesting topic by itself, I now think it is more crucial to first understand the underlying problems and trade-offs before we learn how to get around them.

Scenario

A common use case is the creation of a customer across system boundaries. One system collects customer information, the other system is responsible for validating and storing this data.

Creating a customer in a remote system
Creating a customer in a remote system

Once the process is finished the interaction with the customer system should be started. We expect both systems to be in a consistent state after the operation. Lets assume the communication between the services is done via HTTP, and the interaction with the resource for creating a customer looks like this:

curl -X POST -d '{"name": "Tobias", "country": "de"}'
 http://customerservice.mycompany/customer

What could possibly go wrong? What most HTTP clients give us is a so called at-most-once guarantee, which can be translated to: „A request may arrive and be processed by the receiving end“. While this is certainly enough for query like scenarios, it is a very weak guarantee for command like operations, which should result in a side-effect, like a database update or another service being called. In this example, a lost update in the customer system would result in a completed customer opening process without an existing customer, which most likely is not the intended behavior.

At-least-once Messaging

But how can we improve the situation if we care about (eventual) consistent state? One common approach would be to retry requests in an error situation like a timeout. While this can prevent some lost updates from happening, it is not a silver bullet, especially when a system restarts or crashes happen once a request was initiated. To be safe in those situations it is required to persist the request state so that it can be continued after a system restart.

Idempotent Resources and HTTP Errors

As I have said before retries in error situations are an important aspect to achieve at-least-once behavior. On the other side at-least-once also implies that the same request can be processed multiple times by a receiver. But what is an error situation to recover from, and what operations can be retried safely?

When using HTTP one class of errors are general network errors which result in systems not being accessible. Other situations are timeouts which are especially important in situations where systems respond slowly. Another category are the so called HTTP Server Errors (5xx), which in most practical contexts can be retried safely. Other HTTP response classes do not indicate that situations can be recovered with a retry and will likely lead to endless retry / error loops.

But even when you identified the recoverable error situations it is not necessarily safe to retry the associated operations. Luckily HTTP helps us again by associating idempotency rules with HTTP methods. If you want to do a query with a GET, it is safe to do that because the method is idempotent. The POST method we use to interact with the customer resource is not idempotent in the general case. In order to fix that we can change the Customer Information System to allow idempotent calls using the HTTP PUT method. PUT is safe in the general case because the client already decides on the resource where the customer or any other entity should live. Also PUT expects implementations to send complete representations of the resource, which makes it possible for the receiving side to deduplicate requests. The customer resource interaction now would look like this:

curl -X PUT -d '{"name": "Tobias", "country": "de"}'
 http://customerservice.mycompany/customer/c3cb13e3-78f3-4cb4-a984-ad3d733e9236

The HTTP contract now indicates that this method is safe. Which still does not shield you from erroneous server implementations but thats about the best guarantee you can get anyways ;). If you can not modify the receiving side, you can only try to minimize resends. As a consequence you will have to live with occasional message duplications. If thats not an option in your business scenario, you should stay away from at-least-once messaging after all.

Are you asking yourself why exactly-once is not an option? It is because there is no such thing as a general exactly-once in distributed systems, as this article points out nicely.

Compensating Actions

If we implement a client to support retries correctly, and a server to guarantee idempotence for that operation, we already made the execution more reliable in a distributed setting. But things can still go wrong: Usually the validity of a operation execution is only given within a time bound. In the customer example we would expect the customer to be accessible in seconds or at worst in one minute. If our retry logic now hits that limit, we have to provide a fallback or provide a compensating action which returns the system to a clean state.

Those compensations can be very business specific. In our case we could cancel the process and notify the user which was initially assigned to the customer creation process.

This would bring the COP system in a clean state but not necessarily the Customer Information System. Due to the nature of distributed systems we could have send a command, which was received and processed, without being aware of the response. If we would only compensate the sender system, we may have left the receiving system in a faulty state. Since we don’t know, if the operation on the receiving system was processed, we have to include it into the compensation logic. The good thing is that we used ‚PUT‘ to create the customer so we already know about its location. This makes it easy to build a compensation like this:

curl -X DELETE http://customerservice.mycompany/customer/c3cb13e3-78f3-4cb4-a984-ad3d733e9236

While this seems easy when looking at the operation, we have to apply the same principals as we did for the customer creation. With other words we have to implement an at-least-once behaviour to make sure that the compensation will be applied safely.

While errors like this may sound unlikely and may never happen during development they happen in production systems and therefore we have to adapt our implementations to cope with those scenarios or we have to live with the consequences.

Everything was so easy with traditional ACID

Probably thats an exaggeration but certainly a big category of problems goes away when you can go with local database operations. A consequence of that should be that distribution is avoided where possible. But wait doesn’t that contradict the whole Microservices idea? It does, but only for some incarnations of this model.

The biggest problem with the described business case is not the technical realization. The complexities of distribution are rather a consequence of a wrong functional service mapping. Why should a Customer Onboarding Process and a Customer Information System exist on different systems since their context, their language is so closely related to each other? If this would have been one system in the first place, the extent of distributed agreement would have been reduced. The realization that functionally integrated systems are worth having is nothing new but still valid even in times when fine grained distribution is becoming more fashionable.

This should give you a feeling that while it is possible to develop reliable web clients you should always focus on finding good functional service boundaries. As a consequence we can reduce the amount of distributed operations and therefore the overall complexity of a system of systems. Often it is better to avoid any distribution and go with a monolith. Martin Fowler wrote a great article on what he calls the Microservice Premium.

How to implement this

This blog posts demonstrates one approach and some general principles which are important for building reliable web clients. In reality the problem is that those techniques are not well supported in most development environments. In my next blog post I will demonstrate how Akka Persistence can help with building these abstractions in a meaningful way, so stay tuned.