Throughout the years I have seen a lot of people struggling with writing useful and readable test cases. And I have also seen a lot of existing test code that has more resemblance with a bowl of spaghetti than it helps ensuring software quality. While some say that writing tests at all is better than not having automated tests, well-structured test code is vital to achieving most of the benefits claimed by the TDD community. As structuring tests seems to be a real problem for many, this post collects some personal advices on how to create a clean test code base that helps beyond technically checking program correctness.

The benefits of well-structured tests

Why does the structure of test code actually matter? Why should one bother with achieving clean test cases if a convoluted test function in the end verifies the same aspects of the code? There are (at least) two good reasons that explain why investing work in the structure of test cases is important.

Tests are a means of communication

Although reliably verifying that code performs and continues to perform the intended function is probably the primary reason for writing automated tests, well-structured tests can serve more purposes for developers, most of them boiling down to communication. The disputable Uncle Bob Martin has coined a famous quote in this regard:

Indeed, the ratio of time spent reading versus writing is well over 10:1. We are constantly reading old code as part of the effort to write new code. …​ so making it easy to read makes it easier to write.

(Unit) tests play an important role in guiding a reading developer through a code base. If written with care, they tell stories about the requirements of the system under test, its architecture, and noteworthy corner cases to consider. The test code gives a second perspective on the system and frees readers from analyzing the actual implementation to understand its behavior and requirements. This is only possible if the author of the test cases intended the tests to convey this meaning and to fulfill the goal of guiding others through the code. To my understanding, communicating with test cases is one of the most important aspects about creating useful test code. Similar to code comments, there’s always more knowledge about a system than is told by its direct implementation and tests play an important role in communicating the additional information. Of course, as already noted by Eric Evans, any code should always do its best to tell the correct story:

The assertions in a test are rigorous, but the story told by variable names and the organization of the code is not. Good programming style keeps this connection as direct as possible, but it is still an exercise in self-discipline. It takes fastidiousness to write code that doesn’t just do the right thing but also says the right thing.

Efficiently communicating requirements and telling the story of the system that is not or cannot be conveyed by the production code is one of the most important aspects of tests apart from its function as an automated verification.

Tests as a tool for debugging

Another useful benefit of well-written tests is that they can help finding the source of bugs, especially the ones introduced during refactorings or changes to existing code. A badly written test case that fails cannot provide more information than "this huge mess of code doesn’t work anymore after your changes". However, with nicely arranged single-purpose test cases the message of a failing test case is much more granular and helps the developer by saying "most things work, only this special case isn’t handled properly anymore after your refactoring." Such a message makes it much easier to correct the introduced bug and usually avoids a lot of additional debugging work.

Preconditions for good tests

Although this might sound surprising, the most important techniques for achieving well-structured tests are not directly related to the test code. Instead, the key for being able to write good tests at all lies in the structure of the tested code. Highly coupled code, mixing multiple concerns, is as hard to test as it is hard to comprehend or to maintain. Therefore, applying well-known design patterns and techniques to the tested code base is the primary means for enabling good tests. Therefore, I can very much confirm the well-known claim of the TDD community that TDD leads to a better software architecture. However, this only works of some basics of good design are known. While software architecture and code structure fill books and decades of discussions and a comprehensive discussion is out of scope here, I will highlight two techniques that are of severe importance for being able to test. Surprisingly, most miscarried testing attempts I have seen in the past were the result of ignoring these basic software engineering principles: abstraction and dependency injection.


A key principle of software engineering is being able to reason about a specific problem by avoiding distractions from lower-level programming details. If I implement a complex algorithm for allocating parcels to carriers, I don’t want to think about network transport errors while finding out the capacity of each carrier in the course of this algorithm. Therefore, a common technique is to abstract from the ugly and unimportant details through a class or interface. Introducing an appropriate abstraction for "find out the carrier capacity" lifts reasoning in the capacity planing algorithm to a single level of …​ abstraction and removes the lower-level networking details from the reasoning process. By following this route, I am – in turn – able to write tests that focus on a single level of abstraction without mixing business problems with infrastructure concerns. Moreover, I can test business (capacity planning) and infrastructure (network communication) concerns in distinct tests. This makes the implementations and the tests easier to formulate and to understand.

Apart from simplifying reasoning, abstraction brings most value if technically realized using interface thinking as already described in 1995 in the classic gang of four book with their "principle of reusable object-oriented design":

Program to an interface, not an implementation.

That means our abstraction for "find out the carrier capacity" is an interface resembling the level of reasoning required for the capacity planning algorithm and the actual realization with the ugly details of network programming will then be an implementation of this interface:

// The abstraction at the level of the algorithm.
// We are only concerned about requesting capacities for carriers.
public interface CapacityRequestStrategy {
    public int getCapacity(String carrierId);

// One implementation of the abstraction
public class NetworkedCapacityRequestStrategy implements CapacityRequestStrategy {
    public int getCapacity(String carrierId) {
        // all the ugly HTTP-details here

// The algorithm implementations is free from HTTP details
public class CapacityPlanner {
    private CapacityRequestStrategy requestStrategy = NetworkedCapacityRequestStrategy()

    public void planParcels() {
        while (parcelsRemain) {
            for (String carrier : availableCarriers) {
                // No networking details here!
                final int capacity = requestStrategy.getCapacity(carrier);
                // do some magic to distribute parcels
                // ...

Introducing such abstractions avoids having to deal with multiple concerns in the same unit to test and therefore also enables to formulate test cases reflecting the different concerns. Consequently, test become easier, because concerns are separated and each test only deals with a single concern and less cases to look at.

Dependency injection

Given the code shown above, one might ask how to actually write a test for the capacity planning algorithm that doesn’t need mixing abstraction levels. CapacityPlanner still depends on the concrete network-based CapacityRequestStrategy implementation. We would have to employ fancy things like stubbing the HTTP API used to determine capacities to actually test this class, thereby again resorting to mixing abstraction levels. Yuck…​ Fortunately, a cure for this issue is pretty easy: dependency injection. Instead of directly instantiating the NetworkedCapacityRequestStrategy inside the CapacityPlanner, let someone else provide an appropriate instance to the planner by passing it to the planner’s constructor:

public class CapacityPlanner {
    private CapacityRequestStrategy requestStrategy;

    public CapacityPlanner(requestStrategy CapacityRequestStrategy) {
      this.requestStrategy = requestStrategy;

    public void planParcels() {
        // ...

Enabling dependency injection on a tested unit opens up the opportunity to install a test double [Fowler2006] inside the automated tests that never has to deal with the complexity of networking:

// This is a test double for the production CapacityRequestStrategy
class ConstantCapacityStrategy implements CapacityRequestStrategy {
    private int capacity;

    public ConstantCapacityStrategy(int capacity) {
        this.capacity = capacity;

    public int getCapacity(String carrierId) {
        return this.capacity;

// This test can now be written without having to think about HTTP
class CapacityPlannerTest {
    public rejectsParcelsIfNoCapacityRemains() {
        planner = CapacityPlanner(ConstantCapacityStrategy(0));

            () -> { planner.planParcels() }

Installing test doubles is a real pain without dependency injection, because mocking would be necessary, which is pretty error-prone and usually depends on low-level programming language constructs, thereby bloating test cases with technical details. Moreover, without an appropriate abstraction, the installed test double would again leak details into the algorithm discussions. If an HttpClient were injected instead of an abstraction, I could still install a double, but providing the appropriate behavior in the double would violate the abstraction level suitable for the test and the implementation by resorting to networking again.

Now that the preconditions for being able to write useful tests are met, I can outline my recommendations on how to write the tests themselves.

Guidelines for writing test cases

Apart from the fact that writing good tests is much easier if the code is structured well, the following sections describe a few guidelines that I would recommend following when actually writing the test cases.

Verify one aspect per test case

Often, one can find test cases that look like this:

# floating point equality should be handled properly in real code
def test_it_works():
    assert divide(1.0, 1.0) == 1.0
    assert divide(2.0, 1.0) == 2.0
    assert divide(1.0, 2.0) == 0.5
        divide(1.0, 0.0)

This test fails as a debugging aid, because the only feedback I get when something is broken is that…​ something is broken. The feedback on the level of failing test cases isn’t more specific, because there’s only on test case that either fails or succeeds as a whole. Wouldn’t it be much better if there was direct feedback from test execution that everything works apart from handling division by zero?

To avoid this trap it is much better to write one test case (function) per tested condition. A much better version with the same assertions could look like:

def test_same_numerator_and_denominator_is_one():
    assert divide(1.0, 1.0) == 1.0

def test_numerator_higher_than_denominator_is_above_one():
    assert divide(2.0, 1.0) == 2.0

def test_numerator_lower_than_denominator_is_below_one():
    assert divide(1.0, 2.0) == 0.5

def test_division_by_zero_is_rejected():
    with pytest.raises(ValueError):
        assert divide(1.0, 0.0)

Now, in case my implementation of divide does the math correctly but I just messed up with the exception handling, test results will immediately tell this picture and I know where to start debugging: PASSED             [ 25%] PASSED    [ 50%] PASSED     [ 75%] FAILED                      [100%]

Besides providing valuable debugging aids, building test cases per tested aspect also helps communicating requirements on the tested code effectively. I can now use the test case names to communicate what I require from my code to fulfill its technical, and more importantly, business value. Without such test cases, these requirements are often only implicitly represented in the code base. The test code therefore provides additional documentation and explanations that would otherwise be missing, but it cannot become outdated such as comments or external documentation could.

So, in case some of the following conditions match a test function or method, the test case should probably be split:

  • Many assertions on different properties exist.
  • It’s hard to describe what is tested in a short sentence.
  • Assertions are validated conditionally (if foo: assert …​). This is a telltale sign that multiple requirements are tested in a single test case.

Use test case names to express readable requirements

As a follow up on the previous rule of using test cases to verify individual requirements, another important aspect for the effective communication of requirements is that they are actually readable as natural language from the code. Humans are much better at understanding natural language than they are at reading highly abbreviated code.

Imagine the division by zero example from above were tested like this:

def test_exception():
    with pytest.raises(ValueError):
        assert divide(1.0, 0.0)

If this test fails, would you know which requirement is currently unmet in case test reports show the following? FAILED                      [100%]

Probably not.

Therefore, use test case names to effectively communicate the imposed requirement on the system under test with natural language. A good test case name includes a verb and can be read as a sentence clearly expressing the verified requirement such as in:

def test_division_by_zero_is_rejected():
    # ...

Some languages such as Kotlin allow making this even more readable by supporting (close to) arbitrary characters in method names:

fun `division by zero is rejected`() {
    // ...

Especially when testing the actual business logic of a system, such a way of naming is of real value, because then business experts or the product owner can understand whether the developers have realized the correct requirements by browsing through the test case names (guided by the developers). In case a requirement was forgotten during implementation, this should become clear to business experts, because they can spot the gap in the natural language specification of what is tested. Therefore, by merely naming test cases correctly, a larger steps towards acceptance testing with value for non-technical stakeholders can be taken. Of course, also the next developer of a system or the future self will value expressive test cases that explain the system and its requirements clearly.

Test your business, not someone else’s

From time to time I’ve stumbled across a pretty interesting testing pattern. Instead of verifying the own requirements, tests were written against framework and language features. For instance, one case in Python looked close to this one:

def my_function(a_param: str) -> int:
    if not isinstance(a_param, str):
        raise ValueError("unsupported type")
    return 42

    [None, 42, 0.0, re.compile(r'.*'), ...]
def test_my_function_rejects_other_types(param_value) -> None:
    with pytest.raises(ValueError):

The affected project had decided to use Python type hints and mypy was used strictly. Therefore, all tooling was set up to write Python close to what a statically typed language would look and feel like. Yet, the test authors somehow repeated a lot of what the tooling was already enforcing using unit tests. Apart from the question where to stop in this specific case (there are indefinitely more types to test here), this approach largely increases the amount of test code to maintain and test runtime increases while creating close to no benefits. You should generally trust the tools you select to an appropriate level or you probably shouldn’t use them. Doing someone else’s work by verifying their implementation shifts a large burden on your own code base that will eventually result in more maintenance work without ever ensuring that your code actually continues to meet its own requirements.

One can say that the previous example at least exercises your own code and verifies that a single conditional works as expected. Even worse are situations like the following one.

class TestMyNetworkedServiceAdapter:
    def test_requests_works(self):
        with pytest.raises(ConnectionError):

Such experiments – most likely used to understand the functioning of an upstream library – remain in the test code surprisingly often. This is really nothing more than code bloat and doesn’t help at all for your own project. So just avoid this.

Of course, there are times when some bug or peculiarity of a used library is causing trouble for your code. But you have probably noticed this because of some missed requirement for your own code. Therefore, whenever possible, try to find a test case that reproduces issues with used tools through special cases of calling your own code. That way these special case tests contribute to the set of requirements imposed by your tests on the system under test and they remain valid even if you later decide to completely drop the buggy dependency. Not having to change test code alongside production code changes is always a good thing and reduces necessary work during refactorings.

Clarify requirements with syntax and features, don’t dilute them

Some test framework provide versatile features for writing tests in concise ways. Especially pytest has accumulated an enormous ecosystem of plugins for solving various (repetitive) tasks through syntactic sugar. The general recommendation with fancy tooling is to use it as a means of improving the documentation quality of test code and not for the sake of applying all plugins that are available. For example, despite using the common feature of parametrizing test cases for reducing the amount of test code, the following test function still weakens the documentation capabilities of the test:

    ["numerator", "denominator", "expected_value", "expect_exception"],
        (1.0, 1.0, 1.0, False),
        (2.0, 1.0, 2.0, False),
        (1.0, 2.0, 0.5, False),
        (1.0, 0.0, 0.0, True),
def test_divide(numerator, denominator, expected_value, expect_exception):
    if expect_exception:
            divide(numerator, denominator)
        assert divide(numerator, denominator) == expected_value

While maintaining less code is always something valuable to consider, less code but with higher complexity and lower self-documentation abilities is probably not worth achieving. This version of the tests is still better than the initial test_it_works version, because the test runner reports success and failure for parameter combinations individually and debugging is easier, but the requirements are diluted. Therefore, use such fancy feature for the purpose if making the requirement descriptions stronger, not weaker.

A good example where parametrization is beneficial is to increase coverage within a single requirement:

    ["numerator", "denominator"],
        pytest.param(1.0, 1.0, id="basic case"),
        pytest.param(2.0, 2.0, id="ne 1.0 works"),
        pytest.param(-1.0, -1.0, id="both values negative"),
        pytest.param(1.5, 1.5, id="fractional"),
def test_same_numerator_and_denominator_is_one(numerator, denominator):
    assert divide(numerator, denominator) == 1.0

By using named parameters we can even increase the ability of the test cases to explain their exact requirements.

What to test and what not to test

While the aforementioned guidelines mostly focussed on the design of individual test cases, there is also the question which tests to write and what will be the target of these tests. Tests are an extremely valuable tool greatly helping in the development process. Not writing tests is rarely a good option for close to any software project. Yet, test code is as important as the actual production code and therefore any test that is written adds to the size of the code base and increases maintenance efforts. Fortunately, implementing software is a creative process with a lot of freedom and whenever we want to realize a new requirement or fix a bug, we have the freedom to decide where and how to test it to weigh the benefits of tests with the drawbacks of adding more code. In this regard, [PercivalGregory2020] provides an interesting perspective on this problem:

Every line of code that we put in a test is like a blob of glue, holding the system in a particular shape. The more low-level tests we have, the harder it will be to change things.

Based on this idea they propose to favor higher-level unit tests when possible, and to drop down to testing individual units for specific problems. They also call this "testing in high and low gear". In the end, the test code will be a larger collection of loosely coupled tests using higher-level abstractions such as DDD application services, enhanced with a set of individual unit tests for covering and gaining confidence in complex or tricky cases. The low-level tests are highly coupled to the implementation code and therefore prone to changes alongside refactorings, but they will be relatively few. On the other end, the higher-level tests are also more likely to contribute to the aforementioned aspect of documenting business requirements ([PercivalGregory2020], p. 74). The proper functioning of a getter method is close to irrelevant to the business perspective, but whether I can pay in money and then request the final amount (indirectly through that getter) is a lot more relevant.

This perspective has a close relationship to the distinction between solitary and sociable unit tests [Fowler2014]. Testing on the higher level, including (parts of) the object graph below the high-level entrypoint, will result in a sociable unit test. The tested unit will include (most) parts of its production object graph and socially interacts with these real objects instead of being in solitude created via test doubles. Testing on high gear therefore prefers sociable unit tests.

How would it look like to test on the high level? Here’s a simplified example code base:

class Currency:
    def __init__(self, euros: int, cents: int) -> None:
        # assign members

    def subtract(self, value: Currency) -> Currency:
        # some actual implementations goes here

class BankAccount:
    def __init__(self, balance: Currency) -> None:
        self.balance = balance

    def pay_out(self, desired: Currency) -> None:
        new_balance = self.balance.subtract(desired)
        if new_balance.is_negative():
            raise ValueError()
        self.balance = new_balance

Testing in low gear would mean:

class TestCurrency:
    def test_subtract_works(self) -> None:
        assert Currency(3, 0).subtract(Currency(2, 50)) == Currency(0, 50)

Using an explicit subtract method is cumbersome. Python provides operators and I could easily implement addition and subtraction operators for the Currency class. But my test for Currency is highly coupled to the specifics of how this class is implemented and would need changes to reflect the new operators along this refactoring, thereby adding work to the refactoring task.

The high gear approach instead would look like this:

class TestBankAccount:
    def test_paying_out_works_if_balance_is_high_enough(self) -> None:
        account = BankAccount(Currency(3, 0))

        account.pay_out(Currency(2, 50))

        assert account.balance == Currency(0, 50)

In high gear I have simply left out the detailed test for the Currency class, assuming that its proper functionality can be assured when my higher-level tests succeed. Code coverage is a good hint to find out if high-level tests cover my lower-level units sufficiently or not. This high-gear test is not coupled to the exact set of methods available on the Currency class. Refactoring subtract to an operator would be invisible to this test and it could simply remain as is, causing less work during the refactoring.

The downside of this approach is that when something fails, I lose the exact feedback where the failure comes from. Fortunately, testing as performed by developers is not a form of black-box testing. I can try to judge from the underlying code structure and complexity if detailed feedback will be required or not. If things are complex, add some low-gear tests to get targeted requirements and debugging feedback. If the internal structure of a higher-level unit is quite simple and high-level testing does not hinder development and debugging much, then avoid adding low-level tests that do not add much benefit.

An important note on this procedure is that this mainly applies to testing the business logic of your code base. Even in high-gear sociable unit tests you should provide test doubles for persistence and IO. Otherwise, runtimes of your unit tests will start to increase and the benefit of instant feedback will be lost. Moreover, especially network protocols can have many interesting error conditions that are hard to test from a high-level business perspective. For these things your are better off isolating them in solitary narrow integration tests [Fowler2018]. Here, you can easily construct all kinds of intricate failure situations, which is often necessary to ensure the proper functioning of such adapters to external systems.

How to provide test doubles: stubs and mocks

Dependency injection enables us to effectively swap out production collaborators for a tested unit with an implementation tailored to the specific test. These replacements are called test doubles [Fowler2006] and different techniques exist for creating such doubles. The most prominent ones are stubs and mocks. The term stub is used with slightly varying meanings in different place. The common use of the word stub doesn’t distinguish clearly between stubs and fakes as defined in [Fowler2006]. Any kind of implementation, either providing canned answers to expected calls, or providing a minimalistic implementation of the full protocol, is often blurred under the name stub.

On the other end of the spectrum are mocks. These test doubles are usually created through special mocking libraries in a declarative way by expressing the expected calls and potential answers to them. The second aspects lets mocks also act as stubs. However, as already outlined in [Fowler2007], there’s still an important difference: due to the fact that exact call sequences are used to define their behavior, mocks are coupled stronger to the exact way the tested unit interacts with the test double than fakes would be. Mocks primarily verify if the protocol spoken between the tested unit and the mock works as expected. When using stubs, the exact call sequences are of lower importance, because only the outcome is verified, not the steps towards this outcome. Therefore, stubs allow greater freedom for refactoring without having to change the test code, whereas mocks usually exhibit more coupling with the production code ([PercivalGregory2020], p. 51). Moreover, stubs (fakes) are often easier to comprehend. I can give meaningful names to my stubs and their implementations is usual and simple code that’s easy to read. On the other end, the mock declarations (when using typical mocking frameworks) are often harder to read:

mock = MagicMock(autospec=SomeStrategy)
mock.get_name.return_value = "static-error"
mock.compute_stuff.side_effect = ValueError()

That’s ok’ish to read, but a manual stub implementation (or fake implementation, to be precise) is easier to comprehend in most cases:

class RaisingStrategyFake(SomeStrategy):
    def get_name():
        return "static-error"

    def compute_stuff():
        raise ValueError()

5 instead of 3 code lines. An acceptable price given the gained clarity.

Another problem with mocks is that the declarations of desired functionality and expected calls are usually repeated (with slight variations depending on the actual test case) throughout the test code. Refactorings therefore often result in huge change sets in the test code, because mocking code has to be adapted close to everywhere.

So what is the recommendation here? First, if your code base is testable and uses correct abstractions with dependency injection, then using stubs is an easy task. This avoids one of the primary reasons for using mocks in dynamic languages such as Python, where mocks and monkey patching are used to overcome such design deficiencies. Therefore, my general recommendation – in line with [PercivalGregory2020] – is to use stubs whenever possible and when correct abstractions exist. If these abstractions are lacking, it might be time to add them.

However, also mock-style testing has its role. Especially on the boundaries of a system, where interactions with third-party libraries or systems exist (i.e. ports and adapters), using mocks as a means of exercising and triggering specific boundary conditions is easier. Moreover, for protocol adapters, expecting certain calls is a typical way of thinking and an essential aspect of verifying that the protocol is obeyed by the implementation. Such a requirement naturally maps to the mock-style behavior verification ("When a new product is POSTed, the createProduct method on my ProductManagementService must be called.", "When requesting an HTTP resource, first open a channel, second, initiate TLS, …​"). Therefore, using mocks with their focus on the spoken protocol is always a good idea when the actual interaction protocol is what matters. This pretty much leads to the testing strategy outlined in [Richardson2018], p. 309.


Hopefully, the explanations in this post have at least shown that it is important to think about how to write good test cases. Well-structured test cases provide many benefits and once the mechanics needed to achieve them have become part of your coding muscle memory, the initial overhead of writing clean test code becomes negligible. Ultimately, clean test code is something many people including yourself will value sooner or later and should never be neglected.


  • [Evans2004] Evans, Eric. Domain-Driven Design: Tackling Complexity in the Heart of Software. Boston: Addison-Wesley, 2004.
  • [Fowler2006] Fowler, Martin. “TestDouble.”, January 17, 2006.
  • [Fowler2007] Fowler, Martin. “Mocks Aren’t Stubs.”, January 2, 2007.
  • [Fowler2014] Fowler, Martin. “UnitTest.”, May 05, 2014.
  • [Fowler2018] Fowler, Martin. “IntegrationTest.”, Jan. 16, 2018.
  • [GammaEtAl1995] Gamma, Erich, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional Computing Series. Boston, Mass.: Addison-Wesley, 1995.
  • [Martin2009] Martin, Robert C. Clean Code: A Handbook of Agile Software Craftsmanship. Upper Saddle River, NJ: Prentice Hall, 2009.
  • [PercivalGregory2020] Percival, Harry and Gregory, Bob. Architecture patterns with Python: enabling test-driven development, domain-driven design, and event-driven microservices. Sebastopol, CA: O’Reilly, 2020.
  • [Richardson2018] Richardson, Chris. Microservices patterns: with examples in Java. Shelter Island, New York: Manning Publications, 2019.