This is a follow up to our first post on testing.
What makes a test a good (or bad) one? Obviously, each test has a purpose to fulfill. In our last post we talked about some of them: prevent regression, verify correctness, document behavior. As writing tests requires some effort we want to gain as much benefit as possible in return. On the other hand, we don’t want the tests to hinder the evolution of our system. How do we achieve this?
There’s a saying that you should take care of your test code the same way and to the same degree as your production code.
It is true, it is part of your codebase, comparable in size, why would you treat your tests like a second-class citizen.
For production code we have various sets of rules that try to guide us toward better design like Don’t Repeat Yourself, You Ain’t Gonna Need It, the SOLID principles, Tell, Don’t Ask, aim for high cohesion and low coupling, etc.
Do we have something similar, but specific to the tests?
It turns out that we do.
Kent Beck, (re-)discoverer of Test Driven Development and a proponent of
test && commit || rollback method, created a set of principles that should guide us when writing tests.
They state that each test should be:
- sensitive to changes in behavior
- insensitive to changes in structure
- cheap to write, read, and maintain.
We would like to explicitly add another one here: focused. It can be derived from the principles Kent wrote, but we tend to forget about it too often. Let’s take a closer look at each of those properties.
This almost goes without saying. Nobody wants to wait until their tests are executed. The longer the tests take to finish the higher the chances that we as the developers will get distracted while waiting for them. Each test execution gives us feedback about what we just added or changed in the production code. The sooner we get that feedback, the better. Longer execution times might also lead to executing the tests more rarely. First, we would be talking about minutes, then about hours, maybe even days. The longer the period between test executions, the more bad things we might have done to the production code. It would also be increasingly difficult to find the problem if the tests become red.
Of course, the bigger our system grows, the more tests we will have, and the longer it will take to execute them all. We do not need to always run all the tests. We can limit ourselves to only those tests that are related directly to the production code we’re working on. If we run all the tests only every now and then it should still be OK, we would catch problems early enough.
One of the worst things that can happen to a test is that it becomes “flaky”. Most of the time it is green, but sometimes it fails for no obvious reasons. If we run it again, it will most likely pass. If we have such a test, can we trust it? Does it help us detect undesired situations? Maybe we just learn to ignore it and release the application even if it is red (“Nah, it just fails sometimes, nothing important”). But how will we find out if it started failing for a good reason? Would we react and do something about it? How long would it take for us to notice that the test is not “flaky” anymore, but plainly red?
Also, chances are that such tests will get either deleted or ignored by the testing framework. After all, they’re just annoyances and sometimes block the release (“We’ll have a look at it later and activate it again”). But by doing so you create a hole in your test suite, you allow bugs to creep in, you cannot say anymore “all tests are green, release it!”.
No, we need each test to tell us precisely if we have a problem in our software or not. We must be 100% sure that whenever at least one test has failed, there’s an issue somewhere and we should fix it immediately. It should be out of the question if it can be ignored.
When we shift our focus from a single test to the complete suite of tests of one component or even of our overall system, the tests should provide confidence that everything is working as expected. If all the tests are green then it should be fine to release and deploy the system. This requires that we have written all the necessary tests in a way that they effectively prevent any critical bugs. If, after the deployment, it turns out that something is broken, that’s a clear sign that there’s at least one test missing or not effective enough.
Sensitive to changes in behavior
The behavior of a system is what’s important. The end-user is not interested in how a feature is implemented. She/he wants to be sure that it behaves as desired. So, our tests should focus on this behavior and ensure that it is correct and stable. If the behavior changes, we have to adapt the tests to reflect the new behavior. On the other hand, if the behavior of a component/system does not change, there should be no reason to change the corresponding tests.
Insensitive to changes in structure
The previous statement sounds quite simple and clear: if the behavior does not change the test should not change as well. But, what about structural changes? Let’s say we completely refactor our component, throw away internal classes, and introduce new ones. As long as we do not change the behavior of the component, the tests should run without adapting them.
This is probably the most common issue with our tests, they’re coupled not only to the behavior of our software but also to the implementation details. But, as stated already, the important thing is that your application is solving the business problem. How it is solving it, is not. If your tests depend on the “how”, you lose the possibility of refactoring. By definition, refactoring is changing the structure of your software without affecting the behavior. But tests will now be standing in your way, they will fail not only when the solution would be wrong, but also when the solution is right but achieved differently. Of course, we can adapt the tests along with the refactoring, but in this case we lose the guarantee that our refactoring really did not change the behavior.
Cheap to write, read and maintain
Tests should be helping us, not slow us down. If a test is difficult to write, it will probably be skipped. Yet we established already that it is a good idea to have automated tests. But if the costs are too high it might honestly be a good idea to avoid them. You would need to balance it with executing some manual testing later on. Depending on the context this could be a better approach. But the actual solution is to write our system in such a way that it is easy to test automatically. We’ll deep dive into that topic some other time.
Tests also need to be easy to read. If a test fails we should immediately see what it is about, find the production code it is testing, and fix the problem. If the tests are not that readable, we will be wasting time trying to understand what they’re testing. In this case, they also won’t serve well as documentation.
Changing them also needs to be easy and fast. It would be bad if any change in the behavior of our application would require us to spend more time on tests than on the production code. Again, if using a tool that should be helping us is taking more time than it saves, it is of little help.
Like we mentioned already, each test should be about one thing, and it should be very visible about which one. This goes hand in hand with better readability or insensitivity to changes in structure. In the test, we should see only the details important to a certain desired behavior. Everything else should be hidden and abstracted away.
That doesn’t mean that there only has to be a single assert per test. Multiple asserts are perfectly fine as long as they are verifying a single concept. For example, asserting that a function returning a list indeed returned a list that is not null, has a length of 1, and that it contains a particular element sounds to us like a reasonable thing to do.
Although all the points above are true and we should be striving towards them, it is usually not possible to get all of them at the same time. We will have to accept trade-offs. Therefore, we need to decide which of the principles are more and which are less important in the context of our test.
For example, imagine you want to test your overall system as a black box, not caring about its internal structure. To do this you write some end-2-end tests that do involve network calls. Those tests will be harder to write and will need a little longer to finish. They might even fail sometimes because of some glitch in the network or some other application that was not available for a short time.
Even a much more isolated test can take some time when you, for example, are testing some complex algorithm that just requires a few seconds to produce a result.
If it’s important in your context that your complete suite of tests runs within just a couple of seconds, then you will probably have to write more isolated tests focussing on smaller parts of your system. Doing this, you let your tests drill-down into the internals of your system which violates the ‘insensitive to changes in structure’ principle. That’s not necessarily a big problem as long as you are sure that you only rely on those parts of the structure that are very unlikely to change in the future.
As with many of the decisions that we as software developers have to make: it depends. We just need to bear in our minds that each sacrifice here will cost us somehow, so each such decision needs to be a conscious one.
Basic structure of a test
We’ve spent quite a while talking about the theory. Time to get our hands dirty and look at some code. Let’s assume we’re working on some webshop application and we’re looking at the following test for one of our central components, the shopping cart:
This example is written in Java using the JUnit 5 Test Framework as well as the Mockito library for stubbing and verifying the invocation of dependencies. However, the programming language and libraries are not relevant for what we want to explain. We hope that our choice will not prevent anyone from understanding our examples. You can find the complete example project on Github.
What do you think about this test?
What does it test?
Well, the class is named
ShoppingCartTest and the test method is named
add, so it seems to test adding an item to the shopping cart, right?
But what exactly does it test?
That’s somehow hidden within the vast amount of code.
If you would have to change something in the shopping cart and this test would get red due to your changes, what would you do? Revert your changes, move your story back to the todo column of your scrum board, go home and hope that tomorrow somebody else will pull it? What if all your teammates did the same before you? Let’s accept the challenge and try to convert it into something more readable.
If you look at the test a little closer you will find out that it seems to follow some structure. Although we tried our best to make it as bad as possible (yes, that’s all we were able to manage) it’s still visible and every test follows it. It’s the AAA structure: arrange, act, assert. You might also know it as given, when, then. We will come to this form a little later.
The first couple of lines create some objects and values.
That’s the arrange phase.
It initializes the component under test (the
ShoppingCart), sets up some desired state on which the test will be based, and creates some test data (here two articles to add).
Next follows the act phase.
It triggers the behavior that should be tested, usually by calling the corresponding method of the component.
In our case, these are the two
They are somehow split by some other arrange statements, so we actually have an arrange-act-arrange-act-assert structure here.
After the second
add statement, a couple of
assertEquals statements follow.
This is the assert phase of the test.
It verifies if the action triggered in the act phase produced the expected outcomes.
Those can be direct when the function returns a value.
They can also be indirect when the state of the system changes.
For the sake of completeness, there can also be a final annihilate phase, which is used to clean up resources after the test was executed. In our example we do not initialize anything that has to be cleaned up afterward, so we can skip this phase.
Every test should follow this structure. Sometimes, no explicit arrange phase is needed. Sometimes, act and assert are combined in one line of code. There might even be tests that do not contain any assert because all they expect is that no exception is thrown. But overall, those are the three (or four) different phases that are needed to form a test.
Let’s do a first step to improve our test by moving some lines of code to clearly separate those three phases and make them immediately visible. We can also improve the order of the arrange statements by grouping corresponding statements.
It’s already much better, isn’t it? Just by adding a few empty lines and comments we improved the readability and are now able to understand much better (and faster) what this test does. To make it even more explicit, we also adjusted the name of the test (method) to clearly state what it tests. But we can do even more.
Hide irrelevant details
One thing that disturbs us is that this function is quite long, more than 30 lines of code. Reading and understanding it will take time. Another problem is, that the implementation mixes core aspects of the test with necessary but uninteresting stuff. It will take even more time to tell one from the other. It might even be impossible to do. How should you tell if a particular line of code adds an important detail to the test or is just required by a component you need?
The first thing we can do to reduce the size of the method and thereby improve readability is to pull the initialization and wiring of the
ShoppingCart and its dependencies up as a generic fixture of the test class.
Chances are that there will be more test methods in this class that require the same fixture.
Having the dependencies defined as class members also has another advantage which we will see in a moment.
In our example, we use the Mockito JUnit 5 Extension to be able to create and inject mocks via some annotations. That’s just a kind of syntactic sugar and not everybody likes it. Initializing the mocks explicitly is also fine.
The next thing that catches our eyes is the block of
That’s the Mockito way of verifying that a certain method of a mock was called.
So, these statements test the implementation (the internal structure) of our
ShoppingCart, and nothing else.
As explained before, we don’t want to do this.
We can just remove those statements.
However, we still need to stub the desired responses of those mocks in the arrange phase to run our test.
So, it still relies on the internal structure.
We will later come to this point.
Having done this, let’s look at the different groups of arrange statements, one by one, to see what we can improve here.
The first group creates our two articles to add. We mock them, so there’s not much we can simplify. We could initialize them on class level, as well, just as we did it with the dependencies. But, that would not make it much better, so we keep it as it is for the moment.
The second group sets up a desired state of the
ShoppingCart seems to check the availability of the articles before it adds them.
The test assumes that both articles are available, otherwise it would expect an
InsufficientUnitsInStockException to be thrown, as the signature of the test method indicates.
In the act phase, we add one unit of article 1 and three units of article 2.
It should also be fine if more units are available.
So, the exact number of available articles is not relevant, at least not for this test.
We can hide those irrelevant details and move them into a helper method.
The helper method not only hides the concrete number of units available, it also hides the technical details of how to configure the Mockito mock. By having the mocks as class members we can directly access them without passing them as arguments.
As already said, the test does not expect an
InsufficientUnitsInStockException to be thrown.
Probably, there are some other tests that verify the behavior of the
ShoppingCart if we try to add unavailable articles.
In our test, the concrete exception type is completely irrelevant.
By changing the test method signature to throw a generic
Exception we can also hide this irrelevant detail that might distract from the actual test case. This also removes an (unnecessary) binding to an implementation detail which might change in the future.
The next two groups of arrange statements seem to depend on each other.
The first group/statement configures the
CurrentUser to return the
The next statements configure the
PriceCalculator to return the values
7.5 as calculated prices for the two articles if the
It looks like the concrete customer status is also not relevant for the test.
All that the test depends on is that the
PriceCalculator returns the defined prices, whatever customer status the current user has.
We can extract another helper method to set up the prices for each article.
The helper method uses the Mockito argument matcher
any(CustomerStatus.class) to represent the ‘whatever customer status’ part.
It also hides the technical details of creating the
BigDecimal instance that is returned, so the test itself can be more concise and readable.
Now, setting up the
CurrentUser to return a customer status is no longer relevant for our test.
Although, it still has to be done to make the test run (to be honest, we could configure the
PriceCalculator mock to accept a
nullable(CustomerStatus.class) and everything would be fine, but that would be cheating).
We can add a fixture method annotated with
@BeforeEach to our test case and let JUnit ensure that this setup is executed before every single test in this class.
We use the
REGULAR instead of
GOLD because it’s not relevant which status we use and the special status
GOLD would pretend the opposite.
The last group of arrange statements configures the
ShippingCalculator to return the shipping amount of
3.5 if the subtotal amount is
Why those two amounts?
Well, the first one is the price of the first article.
We add this article with a quantity of one, so that should also be the subtotal amount of the
ShoppingCart at this point in time.
Knowing this, the second amount has to be the (expected) subtotal amount of the
ShoppingCart after adding the second article with a quantity of three.
It turns out that only the second amount is relevant to the test.
ShippingCalculator would return any other amount after adding the first article, nobody would care.
At least not our test.
It only verifies the shipping amount after adding both articles.
But, removing the first statement leads to a nasty
NullPointerException because the
ShoppingCart expects at least any
BigDecimal value after adding the first article.
All in all, this seems to be very similar to the arrange statements we already improved.
Hiding the technical details into another helper method seems to be a good idea.
In this case, the concrete value of
3.5 is relevant for the test because it is expected in the asserts.
So we define the value in the test and pass it to the helper method.
The helper method does not care about the concrete subtotal amount.
It uses the
any(BigDecimal.class) matcher and configures the
ShippingCalculator to return the given shipping amount for whatever subtotal amount is provided.
After all these refactorings our test method has even more improved:
Now, what about setting up all the mocks. We moved all those statements to helper methods, but they still exist and make our test sensitive to structural changes, right? Is there a way to fix this? The answer is quite obvious: don’t mock those internal components, use their real implementations, instead. Of course, that would mean that you would test those components as well and the unit under test becomes bigger. This often leads to a more complex arrange phase because those components might depend on others as well, and so on. Implementing and executing those tests might take more time because of the more complex setup. So, yes, you can do it, but this comes with some costs. As we said before, it’s a trade-off between the different principles.
By extracting the setup of those mocks into helper methods, we already did one step towards reducing the coupling of the tests to the implementation. There will usually be a couple of tests that use the same helper methods and so whenever we change the internal structure and break one of these methods, we only have to fix it in one place, and not in every single test. We could go one step further and bundle all the helper methods to set up one component into a facade that can be used by all test classes that need to mock this component. Although, as long as we want to test components in an isolated way, we will not be able to completely remove the coupling. We should therefore think twice about how small (or big) the units that we want to test should be. We will dive deeper into this topic in one of our next posts.
Make it a Specification
Compared to where we started with our test we could already be quite satisfied with what we achieved. But, there’s almost always a way to make it even better.
In our last post, we said that tests can also operate as a documentation or specification of the system under test. We also noted that we think it’s possible to write those tests in a way that even non-developers, like the stakeholders of our system, would be able to read and understand them. Let us show you what we meant.
First, let’s have a second look at the name of the test.
We already changed it to
addTwoArticles, but that mainly describes the act phase and you still have to look at the implementation to see what exactly it does.
A stakeholder is probably also (maybe even more) interested in the expected outcome (covered by the assert phase).
The outcome might be different depending on the concrete scenario that is tested (mainly defined by the arrange phase).
In a specification you would probably find a sentence like the following to describe the tested behavior:
If two articles with different prices and quantities are added to an empty shopping cart, then the subtotal amount and total amount of the shopping cart is calculated.
That’s quite long and does not even cover all that is tested (what about the availability or the shipping amount?). This might be a sign that our test covers too much behavior and we should split it into two or even more separate tests. In our example, working with cheap and fast mocks, that would probably be a good idea. In the case of an end-2-end test, where setup and execution often are much more expensive, it could have been a conscious decision to bundle all this in one test. Anyway, we can at least make the test name a little more concrete:
Named like this, it refers to our unit under test, the
ShoppingCart, already specified by the name of the test class, and describes one concrete part of its behavior (calculating the amounts) in a given situation (adding two articles).
This style of naming test methods originates from the behavior-driven development (BDD) methodology introduced by Dan North.
This methodology focuses on describing and verifying the behavior (not the implementation) of a system, using a kind of ubiquitous language that is shared by business and technical people.
Above, we use the snake_case notation, because it’s better readable in our eyes. However, not every developer likes underscores in method names. Other forms (e.g. CamelCase) are also fine, as well as using comments or other language features (in other languages like Kotlin or Groovy you can have method names that contain spaces, just to have it said ;) ). Naming tests does not differ much from naming anything else in software development - it’s a field of its own. We will come back to this topic in another post.
What we could also adapt from BDD is another terminology for the structural phases of a test. Instead of arrange/act/assert BDD uses the terms given/when/then which allow formulating tests (specifications) in a more natural language. In our example, this could be
given an article with price 9.95, available in stock
and an article with price 7.5, available in stock
and a shipping amount of 3.5
when I add the first article with quantity 1
and I add the second article with quantity 3
then the subtotal amount of the shopping cart is 32.45
and the shipping amount of the shopping cart is 3.5
and the total amount of the shopping cart is 35.95
That reads pretty good, doesn’t it?. But we still have to use programming syntax to implement our test, so how does this help? One option would be to use BDD tools like JBehave or Cucumber to map this plain text specification to the different parts of test implementation that can be executed. But that’s not the only way we can go. We can get quite close to this format by developing a kind of DSL using techniques like a fluent API and the builder pattern. We don’t want to go to much into the details in this post, but here’s how our test could also look like:
Written like this, even non-technical people should be able to understand what happens. As you can see we can do almost the same in the when-block by adding some custom asserts (we use AssertJ in this example). If you’re interested in the implementation of this DSL you will also find it in the complete sources on Github.
Just to make it clear: We don’t say that every test has to be written that way to make it ‘good’. We just wanted to show how it could be written to make it clear and understandable. It might look like doing this for each test has to be a big overhead. Our experience shows that, after some trials (and of course also errors), the evolving project-specific test framework helps writing new tests very easily and quickly. In non-trivial projects, the effort usually pays off after some time. But, as usual, think about what makes sense in your context. We would like to encourage you to try it yourself and share your experience.
In this post, we started with mentioning principles that should help us create good tests like being fast or insensitivity to structure changes. We briefly discussed why those principles are important and how they help the overall software.
After that, we presented a way of structuring the tests so that they become much more readable and understandable. Such tests clearly say what they’re about and hide all the things that are not relevant. They’re also much shorter so working with them takes much less time. With some extra care, they can become so good and clear that even non-technical people can read them.
That’s not everything we can do to improve tests. Some topics we merely mentioned but didn’t provide many details. For example, creating test data deserves a separate post, and we will probably find time to write one, soon.
Yet, in our next post we will first tackle a much more difficult and important topic: tests granularity, so how small or big should the part of the system be that I want to test. Until next time!
Header photo by Robina Weermeijer on Unsplash