Lessons from Testing Distributed Systems

Testing distributed systems is brutal. There's a near-infinite space of potential problems arising from different failures happening across different processes at different timings.

In this post, we want to share our learnings and best practices from testing the correctness at scale of a particular type of distributed system: a durable workflows library. To be truly robust, a testing strategy for a distributed system can’t solely rely on unit and integration tests (although those remain useful). Instead, tests must explore the space of possible failures, either mathematically or through carefully induced randomness. We'll first overview some of the most prominent techniques in the industry, then discuss our testing strategy in detail.

Approaches for Testing At Scale

Here are a few of the most popular ways to rigorously test the corrrectness of distributed systems:

Formal verification: This approach involves writing a formal specification in a language like TLA+ and proving that it satisfies desired properties. It’s mathematically rigorous and has been used successfully at scale by companies like Amazon (see this paper or this blog post from Marc Brooker).

Unfortunately, formal verification remains prohibitively difficult to adopt. Maintaining both formal specs and code implementations adds significant development overhead, and there's no good way to formally model interactions with complex real-world dependencies like Postgres (which is core to our durable workflows library). Moreover, formal verification only goes so far: even if you prove your spec is correct, there's no guarantee your code correctly implements it.

Deterministic Simulation Testing: This relatively new strategy involves building a testing harness that can run an entire system deterministically on a single thread (for more detail, see this blog post from Phil Eaton). That way, it is possible to fully control event ordering and perform reproducible fault injections. Deterministic simulation is challenging to implement, but recent work has made it approachable (for example, using the Antithesis deterministic hypervisor).

One major challenge in adopting deterministic simulating testing is dealing with external dependencies. You can’t directly test how your system integrates with other systems–instead, you have to mock your dependencies. However, mocking complex dependencies like Postgres will almost certainly miss some behaviors and obscure critical bugs, making this strategy problematic for any system that depends on them. That said, there’s been some recent work (particularly from Antithesis) on better simulation of external dependencies, so this is a space worth watching.

Chaos Testing: This approach tests a system by running complex real workloads while randomly injecting faults such as server crashes, network delays, disconnections, or latency spikes. Some prominent examples of chaos testing at scale include Netflix’s simian army and the famous Jepsen tests for databases.

To test durable workflows at scale, we heavily rely on chaos testing because it stresses the system under realistic conditions and directly tests the critical interactions between our systems and Postgres.

How Chaos Tests Work

Our chaos test suite launches hundreds of distributed processes and subjects them to complex workloads that exercise various durable workflows features, including workflow execution, queues, notifications, and more. While these workloads run, the test suite randomly injects failures to simulate real-world issues, including:

  • Random process crashes and restarts.
  • Unexpected database disconnections and crashes.
  • Rolling out new application versions mid-flight, with updated code running alongside older versions.

The chaos tests verify that durable workflows maintain key properties despite unpredictable failures. Some of the core invariants we validate include:

  • Workflow durability. Every submitted workflow eventually completes, despite process failures, restarts, and database disconnections.
  • Queue reliability. All enqueued workflows eventually dequeue and complete, despite process failures, restarts, and database disconnections.
  • Queue flow control. Processes always respect queue concurrency and flow control limits, never running more workflows than allowed, even if workflows are repeatedly interrupted and restarted.
  • Message delivery. If a notification message is successfully sent to a workflow, the workflow always receives the message, even if repeatedly interrupted and restarted.
  • Version correctness. Workflows are never dequeued or recovered to processes running different application versions, even when processes running multiple different application versions are concurrently active.
  • Timeout enforcement. Workflow timeouts are always respected, even if a workflow is repeatedly interrupted and restarted across different processes.

To make chaos tests robust, it's critical to design them to validate only these system-wide invariants, not test-specific assertions. For example, a conventional integration test might send an HTTP request to a server to start a workflow, validate that it returns a 200 response, then later send another HTTP request to check if the workflow completed and validate that it also returns a 200 response. However, a chaos test making such checks would have a high false positive rate, as servers might crash mid-request.

Instead of checking whether individual requests succeed or fail, a chaos test should validate the underlying invariant: that a successfully started workflow eventually completes. For example, the chaos test could request a server start a workflow with a specific ID, retrying many times until the request succeeds. Then the test could poll the server to check the workflow's status until either the server confirms the workflow has succeeded or a sufficiently long timeout has passed. Such tests might use a utility like retry_until_success, which retries the assertion of an invariant hundreds of times until it either passes or times out. Checks like this ensure tests only fail when an invariant is broken, not due to transient issues.

Anatomy of a Bug

Rigorous chaos testing has helped uncover several subtle and hard-to-reproduce bugs. Here's one concrete example.

The symptom: Occasionally, a workflow would get stuck while waiting for a notification. The notification was successfully sent and recorded in the database, but the workflow never woke up and consumed it.

The investigation: Digging deeper, we noticed this freeze only happened immediately after a database disconnection. The workflow was automatically recovered, but sometimes would still hang. It was rare because most of the time the workflow resumed as expected, but it pointed us toward a possible race condition involving database disconnects while receiving a notficiation.

The root cause: The issue was in how the recv() function handled concurrent executions of the same workflow. To avoid duplicate execution, it used a map of condition variables to coordinate across concurrent executions. When two instances of the same workflow attempted to receive the same notification, one would acquire the condition variable and wait for the notification, while the other would detect the conflict and wait for the first to finish.

The problematic code looked like this:

The bug only occurred if a database disconnected precisely while this code was running: the SQL query (c.execute()) would throw an exception, causing the function to exit early without releasing or removing the condition variable from the map. When the workflow later recovered, it would find the lingering condition variable, assume another execution was still active, and block indefinitely, waiting for the original (now-failed) execution to finish.

The fix: Once we understood the problem, the solution was straightforward: wrap the entire code block in a try-finally to ensure the condition variable is always cleaned up, even if an error occurs:

This is exactly the kind of bug that chaos testing is designed to catch—subtle, hard to reproduce, and nearly impossible to find with unit or integration tests alone, as it only occurs when a database disconnection happens at a precise moment. The original fix can be found in this PR.

Learn More about Durable Execution

If you're building reliable systems or just enjoy uncovering weird edge cases, we’d love to hear from you. At DBOS, our goal is to make durable workflows as lightweight and easy to work with as possible. Check it out: