TLA+ (standing for Temporal Logic of Actions) is a formal specification language designed by Leslie Lamport for the specification of system behavior. Roughly speaking, it allows you to model all the states that your system can have, and check certain properties on these states.
TLA+ is a rather large and complex language, with another language built on top of it called PlusCal. This is why I’m not going to attempt a systematic introduction into the language, but we’ll be doing a whirlwind tour of TLA+ based on a simple example that most people are familiar with: Complicated legislative procedures on a Greek island. Just kidding, we’ll model how you’d order pizza at a party if you’re a nerd.
Two-phase Pizza Party
Let’s take the process of ordering pizza for a large party of people. First, you say “Hey everyone, let’s order Pizza!”. This makes you the coordinator. Everyone chooses a pizza and tells you their choice. You try to remember the choices (or better yet, write them down), and make sure everyone’s selected a choice. Then, you tell everyone what you’re ordering. If no one complains, you call the pizza place and have them bring you 7 hawaiians with extra tuna. As discrete steps:
- Tell everyone you’re ordering pizza
- Receive choices from everyone
- Once everyone’s selected a pizza, tell everyone their choice
- Everyone confirms their choice, or says “Wait, let’s do that again” to restart the process
- You order the pizzas
I’ve modified step three a bit, compared to “real life”, where you just read out the list of pizzas you’re ordering. This process ensures that everyone gets the pizza they wanted, even if they changed their mind. The key is step four, where everyone’s asked to confirm their choice. If someone doesn’t confirm, you’ll have to do the process again to make sure everyone’s happy with their choice. Note that you’re only allowed to change your mind between steps 2 and 4. Once you’ve confirmed your pizza choice, you’re stuck with it (hawaiian with extra tuna it is, sorry).
The reason you’re doing this complicated process is that you want to make sure that at some point in the future, you’re ordering exactly the pizzas that are wanted. I’m phrasing it this way to exclude the pizza place, who might screw up your order. Modelling a restaurant is beyond the scope of this article, so we’re only looking at the ordering process.
“Eventually, you order exactly those pizzas that are wanted” is a property that we’ll be able to verify with TLA+. But first, let’s try and start specifying our party, and then let’s see how we can verify the correctness of our order.
Temporal Logic of Pizza: Modelling Pizza Ordering
I’ll try and explain the syntax as we go, but you can use the TLA+ syntax cheatsheet to find any symbols you don’t understand. The files for the different steps are linked in the respective sections so you don’t have to do copy-pasting.
First, some fundamentals. TLA+ is heavily based on set theory. Writing a TLA+ specification entails defining an initial state for the system, and expressing the “state transitions” to the next state. The TLA+ model checker (“tlc”) then applies the “transition” operators to the initial state, until no more states can be reached. TLC also tracks the path to each state, so we can tell how the sytem “got to” a certain state. You’ll see an example of these traces in a later section, when we’ll first encounter a failed verification.
We express initial states and transitions with “operators”, something that we would call a function in other programming languages. Let’s see an example for defining the initial state of our pizza party with an operator called
Init, some variable definitions and some initial operators. I must warn you, TLA+’s syntax is uh… eclectic.
Truly a syntax only the creator could love (and it’ll get worse). But let’s discuss the semantics. The first line declares two variables, one for what every person is ordering, and one for what order the coordinator has written down for every person. You’ll notice there are no type definitions. TLA+ is untyped, so there’s no need. There’s a way to mitigate this slightly that I’ll introduce later.
Then, we define some constants: The set of people and the set of available pizzas. Sets are introduced with curly brackets. Finally, we have a constant for when somebody hasn’t ordered a pizza at all. We have three people, and three types of pizza.
In line six, we define our initial state for the variables. Both variables are something that TLA+ calls a function, but I tend to think of as a map. We initialize both ordering and ordered to a map mapping every person to the
no_pizza constant. That’s because no one has ordered anything yet.
You’ll also notice the
/\` at the beginning of the lines ofInit`. This is something called “bullet point” syntax, where all clauses on the same indentation level (yes, whitespace is significant in this case) are combined with the leading operator. We’ll also see nesting bullet points later.
Now, let’s model step two in a bit more detail, where people decide on a pizza, and they tell the coordinator their choices.
More syntax: We’re defining two operators that express a person deciding on a pizza, and a person
telling the coordinator their pizza choice. Line three uses the existential quantifier ∃. This
expresses that there is at least one pizza for which the following clause is true. In practical
terms, we’re selecting any pizza from the set of all pizzas. In line four, we update the ordering
map. The quote mark (called “prime operator” by TLA+) means that we are defining the value of
ordering in the next state. The syntax on the right means “update the map for a single value”. In
this case, we update
ordering to store our decided-upon pizza. The last line of the operator tells
TLA+ that the
ordered variable is not changed. If you forget this, TLA+ will complain, because the
ordered could be anything after the operator was evaluated. Note that we’re not allowing
people to change their decision at the moment. You can also see the nested bullet point syntax here.
TellCoordinator operator should be almost clear now. Two tidbits: Square brackets do lookup in a map, and the hash mark stands for “not equal”. Basically, we’re checking to see if we’ve told the coordinator our pizza wish yet. If not, we extend the
ordered map by setting it to the pizza we’ve decided on.
Now, we get to step three, the “confirm” round.
We first define a helper operator
ConfirmPerson that checks whether the pizza a person ordered is identical to what the coordinator has received. We then define a
ConfirmAll operator that checks that everyone has ordered: For all people, it must be true that we’ve received a pizza choice. Then, we can confirm that this pizza choice is correct. You’ll notice that this is an operator that doesn’t change any variables. This is because we’ve reached our end state: We’re confirming the order and everyone gets what they wanted, so we can just stay in this state forever.
We’re almost done. Now we need to glue those operators together with some extra syntax (the
\\/ operator is the Boolean “or”).
Traditionally, the operator representing the spec is called “Spec”, the operator for setting up the
initial conditions is called “Init”, and the operator for selecting the next possible states is
called “Next”. The two square brackets are the first temporal operator we’ve seen. It translates to
“always”. In this example, it means that the operator
Next must always be true. If that is not the
case, we’ve reached a deadlock in our system, because we cannot change state any more. The
underscore syntax relates to something called “stuttering”: TLA+ always allows nothing to happen.
Steps where neither
ordered change are called “stuttering steps”, and they’re useful
when combining specs. The underscore syntax also allows steps that change neither
ordered. I’ll explain more about stuttering later, when we’re running the spec.
Next makes a
person either decide on a pizza, tell the coordinator about their choice, or confirms the pizza
But for some syntax boilerplate, we’re done! See the complete file plus the config file here. Now, you’ll need a way to run this spec (and the ones we’re about to write), so let’s do a quick detour into how to run the model checker.
Verifying your Party: Running TLC
Download the files we’ve just written and extract them. To run the TLA+ model checker, I use VSCode with the TLA+ extension. Installation instructions for the extension can be found here.
When you open the
.tla file with VScode, you’ll see some more syntax (that’s it though, I promise!) at the beginning and end:
These define the module boundaries, and I always copy-paste them between files. Note that the module name needs to be identical to the
.tla file name and the
.cfg file name. Have a look at the
.cfg file. There’s only one line, telling TLC where to find the spec we’re trying to verify.
Now, you can check the spec by executing the “Check model with TLC” action (either via the command pallette, or a keybinding you’ve defined beforehand).
When you run the model, you’ll get a success and some debug output.
You can see that TLC reached all of our operators. The number of states is incorrect, which I think is a race condition in the TLA+ plugin. If you look at the full output, you’ll see that TLC has generated 343 distinct states, one for each possible variable assignment.
What have we verified? When looking at
TellCoordinator, we see they can only
be executed once: Once someone has decided, their pizza choice is no longer
no_pizza, so they
can’t decide again. When someone has told the coordinator their choice, their ordered pizza is no
no_pizza, so they can’t tell the coordinator again. If we did not have the
operator, we would get a deadlock: When everyone has decided on a pizza and told the coordinator
their choice, there’s no longer a next state in the system. With the
ConfirmAll operator, a
successful verification means that we’ve reached our “final state”, and the partygoers get the right pizza. Note that the concept of “final state” is not a TLA+ concept. TLA has no problems with systems running forever. It is up to us as the specifier to model the system behavior that we want.
Now, our verification result is not a surprise: When people aren’t allowed to change their mind, none of this two-phase ordering rigmarole is necessary. Let’s see if we can model indecisive eaters as well.
I’ll Have… uh…: Modelling Indecisive Guests
Let’s allow people to change their mind about their pizza exactly once, and see if our process still works. We’ll need to introduce a new variable to store which person has changed their mind already, so we add that. Additionally, we initialize this variable to the empty set, because no one has yet changed their mind. Also, we need to keep track of who already told the coordinator their pizza choice. Otherwise, they could just tell the coordinator again after changing their mind, which is something we don’t allow.
First, let’s adapt the
TellCoordinator operator to store whether we’ve already stated our pizza preference:
Make sure we haven’t already told the coordinator our choice and that we’ve selected a pizza. Then, we update the
ordered map, and add ourselves to the
told_coordinator set to show we’ve already committed. Finally, the rest of the variables are unchanged.
Also, let’s model a person changing their mind:
This is very similar: In order to change your mind about a pizza, you must have already chosen a pizza, and not have already changed your mind. Then, you select a pizza that is different from the one you were thinking of ordering previously and update the
ordering variable. Finally, store the information that we’ve changed our mind. The rest of the variables are unchanged. Don’t forget to update the
UNCHANGED statements for the other operators. The files for the next example are here. Now, try verifying your model again.
No Pizza for us: We’ve got a deadlock
This time, you’ll get a failure, namely a deadlock. This means we can no longer go to a new state. Why is that? Helpfully, TLC provides an execution trace showing us how we got to the deadlock.
We can see here that Twoflower decided on a pizza (double beech, yum!), told the coordinator about
it, and then changed his mind that he’d rather want hawaai with extra tuna instead (blegh). However, he (and everyone else) have already told the coordinator of their choice, and can’t tell him again. So there’s nothing we can do next. Since
ordering is different from
ordered, we can’t place our order, as the pizzas would be wrong. What do we do?
We can do the whole thing again.
One more time: Retrying
Let’s model this. Compared to the previous modification, this is simple: We add a
Restart operator that empties the
told_coordinator set and the
ordered map. To do a restart, we must find out we’re in a deadlock. This has happened when everyone has told the coordinator about their choice, but the
ordering maps are different. We define this operator as follows:
Note that we’re not resetting the
mind_changed variable: We’re only allowing people to change
their mind once. Otherwise, we could have an endless loop of people changing their mind, and the
process restarting, without ever being able to put in an order. Proving that we would be able to
eventually submit our order is interesting as well, and we’ll do it in a later section.
Finally, we add
Restart to the
Next operator to make it possible to restart.
You can find the complete code here.
And when we run this, we get a nice green success. We can understand why by thinking about all the ways pizza orders can happen: The happy case is everyone deciding on a pizza, and telling the coordinator. We are done. The other happy case is everyone deciding on a pizza and some people change their mind before telling the coordinator. Then, everyone tells the coordinator what their choice was, and our order is correct. The unhappy case is someone changing their mind after telling the coordinator. We notice our order is incorrect when reading it back, and start the process again, this time with people not being allowed to change their mind if they did in the last order round. Eventually, we’ll have collected all the correct orders.
For this simple example, we can go through all the combinations of states in our head, but for larger systems, something like TLA+ is a godsent. In our example, we included the possibility of people changing their minds before telling the coordinator, which is something that would have been easy to forget if we modelled the system informally.
Make sure you’re ordering Pizza, not Burgers: Invariants
Before forging onwards to fairness, let’s talk about invariants. Invariants are something that should always be true of a system, no matter its state. After deadlocks, invariants are probably the next most useful thing that TLA+ allows us to define. Since I struggle to think of a useful invariant for our pizza ordering example, let’s instead use it to mitigate TLA+’s weak typing (as I hinted previously). It’s common to define an invariant that makes sure your variables are in the correct range. Let’s first write a type invariant operator called
We’ve used some additional language features:
\subseteq, as you can probably guess, is the subset
operator. We use it to make sure that
told_coordinator are always a subset of
people, and nothing else gets added in. The
DOMAIN function returns the domain of a function,
i.e. all values for which it is defined. When thinking of functions as maps, it returns the keys of
that map. Here, we check that we only store information about people we’ve defined. If a pizza became sentient and started ordering itself a family, that would be bad. Finally, we check that people have only ordered pizzas (or the
no_pizza placeholder), and not, for example other people. That would, again, be bad.
How do we use this invariant? We add it to the configuration file. This is the first time we’ve done something interesting with the config file. So far, the contents were always
SPECIFICATION Spec, but now we’re adding a new line to specify our invariant. The complete source code is here.
This tells the model checker to verify that the operator
TypeInv is true for every single state of
the system. Invariants can be incredibly useful for verifying that things you think should hold for
every state of the system actually are true in every state. For example, a bank account’s balance
should always be greater than or equal to zero. Verifying that this is true is not always easy in
the presence of
without introducing the concept of transactions.
You can run TLC to get a nice green result, showing you that your variables are always in the range you think they should be.
As a small aside (and a nice segue to the next section): The
INVARIANT configuration option is a special case. We can also define our type invariant using the “always” operator
. Of course, we need to change our configuration file to
PROPERTY AlwaysCorrectlyTyped as well.
AlwaysCorrectlyTyped specifies that
TypeInv should always hold. The other temporal operator that we’ll discuss in the next section is the “eventually” operator
But I Wanted That: Fairness and Eventual Properties
In the previous examples, we’ve added a “final” operator called
ConfirmAll that makes sure our system doesn’t deadlock when our order is complete. In this section, we’ll not include
ConfirmAll in the spec, but try to verify it as a property of the system: “Eventually, we will have the correct order”.
To implement this, we modify our
Restart operator to not check whether we’ve reached a deadlock by removing the check for an incorrect order, and resetting the
mind_changed variable after a restart. Additionally, we remove the
ConfirmAll operator from the
What we’ve done is allowed a restart with any order state. Remember that we want to do our verification as a separate invariant. When we verify this spec, we get a success. Why is that? Well, constantly restarting the order is something that TLA+ is absolutely fine with. As long as there’s no deadlock, the system is working. It is up to us to specify what we want the system to do. So let’s add a property that makes sure that we’ll eventually get a correct order. We do this by removing the
ConfirmPerson operators, and defining two new operators:
OrderCorrect is almost identical to the old
ConfirmAll operator, but without the
UNCHANGED call. For verification, it’s just a Boolean function without any effect on the state of the system. The complete code is here.
Let’s try and verify our system. We’ll get a “Temporal properties were violated” error. Why is that? Let’s look at the trace:
It looks like Rincewind is refusing to tell the coordinator of his order (ashamed of his order, no doubt). Recall the stuttering steps I’ve mentioned previously? This is them showing up to cause prevent us problems. Who says that your guests have to place an order at some point? Peer pressure may not be a strong enough factor to force them to divulge their order to you, and this is merely TLA+ alerting us to this fact.
This wasn’t a problem in earlier chapters, because we were only verifying the absence of deadlocks in our ordering process. We only verified that something can always happen, not that it actually does happen at some point in the future. It’s the same story with invariants: Stuttering steps do not modify any variables, so when the invariant holds in any state, it will still hold after an arbitrary number of stuttering steps.
Not so with our
EventuallyCorrect operator: What we’re verifying is neither deadlock-freeness, nor invariants (safety), but “productivity” (liveness): We want to test that after a finite amount of time, we’ve successfully ordered our pizzas. With stutter steps, this might not be the case. TLA+ can insert an arbitrary number of stutter steps, which means that we might be waiting an infinite time to receive pizza orders from guests.
How do we tell TLA+ that our guests are susceptible to peer pressure (yes, even the Luggage) and will eventually place an order? We can add assertions about fairness. There’s two notions of fairness in TLA+, but I’ll skip the details, as liveness and fairness are complicated enough topics in and of themselves. First, let’s see our spec with added fairness. For readability reasons, we’ve introduced a list called
vars for all defined variables. The “double chevron” syntax that you might have noticed earlier is TLA+’s way of expressing sequences (or, as we in the business like to call them, “lists”).
We’ve told TLA+ that if the operators
TellCoordinator are available for a
person, they will get executed at some point in the future. We still allow insertion of stuttering
steps, but now we know that it will be a finite number of stuttering steps. All guests will
eventually decide on a pizza, and all guests will eventually tell the coordinator of their choice.
Note that we didn’t make
ChangeMind(p) fair: We don’t really care whether people change their mind
or not, and it isn’t required for the completion of our pizza order.
Let’s try verification and see what the problem is now.
It’s Rincewind again, the silly bugger. It looks like he’s constantly changing his mind! And why wouldn’t he? You didn’t say “Let’s order the correct pizzas in a finite amount of time!”, you said “Let’s order pizza!”. It may sound like we’re back at the beginning, but that isn’t entirely true. By reformulating our “order is complete” operator with a temporal property, we’ve introduced the concept of time. The first few specs didn’t model malicious guests simply refusing to order. Our reformulation alerted us to this possibility, and we took steps to mitigate it.
Let’s Get It Done: The Final Spec
In an effort to bring this to a satisfying finish, and to prevent Rincewind from stalling our pizza party with his indecisiveness, let’s design a new mechanism for kicking people from the ordering process if they change their minds too often. Let’s first design the mechanism in more detail.
If, during the ordering step, a person has caused the pizza order to fail by changing their mind
after submitting their order for more than three times, they don’t get a pizza at all and the order
is completed without them. We can model this by storing the number of times someone has interfered
in the order, and ignoring their order in the
OrderCorrect operator if their interference count is equal to the maximum number of interference.
There’s several things we need to do: Tell TLA+ about integers, store the number of times a person has interfered in the ordering process, and change our
OrderCorrect operator to check whether noninterfering people have their order correct. Let’s get started.
First, explaining about integers, and introducing a variable for storing the number of interferences. We also don’t need the
mind_change variable any more, since people can change their mind as often as they want, they’ll just get ejected if they do it at an inopportune moment. I’m just writing down the changed lines here, the full code can be found at the Link.
So far, nothing new. However, updating the interference is a bit more tricky: When restarting, we want to increment the interference counter (if it’s smaller than
max_interfered) if the person was holding up the order. Otherwise, the old interference counter is retained. I’ve defined a helper operator to make this easier to read. First, let’s have a look at that helper operator that calculates a new
interfered map and our rewritten
This implements our interference calculation. If the ordering for a person is incorrect, that person interfered in our order, and we make a big tally mark on the “might not get pizza” list. If they’ve already interfered for the maximum amount of times, we don’t need to do anything, they’re already out. Otherwise, just keep the original interference count (the person was good this round). Now, we calculate a new
interfered map in restart.
mind_changed is gone because we don’t need it any longer. People are free to change their mind as often as they want, as long as they don’t interfere with the order too often.
Finally, besides removing
mind_changed and adding
interfered to the other
UNCHANGED calls, there’s now only the
OrderCorrect to write. You can try your hand at this, it’s not too complicated.
You get one paragraph of thinking time.
Okay, here’s the solution:
The order for a person is correct if either they’re being ignored for interfering, or they’re not interfering and the order you’ve got corresponds to what they want. Additionally, they must have ordered some pizza.
You can now verify this. Due to the introduction of integers, this takes a bit longer (about ten seconds on my laptop). This is also why I’ve bounded the interference value in
CalculateInterference: If we didn’t, TLA+ would have to verify all possible values of
interfered, which might take a while longer.
And this is it, the final spec. We’ve devised a pizza ordering scheme that allows people to change their mind (with possible repercussions), and made sure that we eventually order the correct pizzas. I’ve got millions of ideas on how to extend this spec (Maybe allow people to “make good” their interference? Getting your order to finish might be tricky then…), but we’ll stop here.
One… More… Spec…: Further Reading
If you’re interested in writing TLA+ specs now, there’s some good resources for such a niche topic. First and probably foremost is Hillel Wayne’s Learn TLA, which starts at about the same place we do, but focuses more on PlusCal, which is a language that translates to TLA+, and allows expressing some more abstract concepts such as processes and algorithms.
TLA+ has many more elements than what we’ve covered in this whirlwind tour: More structured data types, externalizing constants into
.cfg files, specs that satisfy other specs, and so on. If Learn TLA doesn’t cover it, you might find it in Lamport’s book about TLA+, Specifying Systems. It’s much heavier than Learn TLA, but it’s information straight from the professor’s mouth.
I hope you’ve learned something from this article. Maybe next time you’re writing a scheduling algorithm or a work queue, reach for TLA+ and try and specify the behavior before putting key to IDE. You might be surprised at what edge cases it uncovers for you.
Fun implementation detail: It‘s always him because he’s the first element of the people set, and TLA+ does the state search that way. ↩