Why Stateful Distributed Applications are Hard to Build: Reliability

Introduction

Stateful applications are everywhere, but they're surprisingly hard to get right, especially when you really care about the consistency, integrity, or security of your data. For clarity, when we say "stateful applications," we mean applications that both read from and modify some kind of persistent state, usually stored in a backend data store (these are sometimes called OLTP applications).

In this series of posts, we'll talk about some of the problems we see developers face when building stateful cloud applications and how existing tools and infrastructure can and can't help.

Because stateful applications modify data, making them reliable is hard. If a service is interrupted, you can't just restart it—you have to figure out what requests it was handling and resume them from where they left off. For example, say you restart a server hosting the backend for an online store. The server is likely processing several customer orders in various stages of completion. Some orders have just arrived, some are processing payment, others are paid and ready for fulfillment. For each of these orders, you have to figure out what stage they were in and resume their execution, so that each order runs to completion (if a customer pays for something, you always fulfill their order) and each of their stages executes exactly once (so you don't charge a customer twice in one order, or oversell a product).

Statefulness and atomicity

In technical terms, stateful systems need a property called atomicity. Their operations must be all-or-nothing, either fully completing exactly once or aborting and fully rolling back. The traditional way to make an operation atomic is to write it as a transaction in a relational database system like PostgreSQL (you may recognize atomicity as the A in ACID). This works great if your application is a monolith, storing all its state in one database and minimally interacting with the outside world.

However, fewer and fewer applications fit this description. Development is moving to a more distributed model, where applications span multiple services, each with their own database, and also rely on external APIs. You can't wrap all these distributed operations in one database transaction!

Let's look at what it takes to get atomicity for a simple modern application. Here's simplified code for the checkout flow of an e-commerce site that uses Stripe's Checkout API for payments. It needs two HTTP handlers. The first handler receives a checkout request, reserves inventory for the order, creates a Stripe checkout session, and sends its URL back to the browser so Stripe can handle payment processing:

// Checkout endpoint receives checkout requests from end users.
app.post('/api/checkout', async (req: Request, res: Response) => {
    // Retrieve shopping cart info and create an order in the database.
    const order = await createOrder(req);
    // Reserve inventory if enough is available.
    if (await reserveInventory(order)) {
        const stripe_session = await stripe.checkout.session.create({
            /* Stripe API inputs */
        });
        // Redirect to the Stripe payment page.
        res.redirect(303, stripe_session.url);
    } 
    // Return error to the user.
    res.status(500).send('Not enough inventory!');
});

The second handler receives an event from Stripe once payment processing has either succeeded or failed. If payment succeeded, it starts the process of fulfilling the order, otherwise it cancels the order and returns the reserved inventory:

// Webhook endpoint receives notifications from Stripe.
app.post('/api/stripe_webhook', async (req: Request, res: Response) => {
    // Verify the event came from Stripe.
    const event = stripe.webhooks.constructEvent(req.body, /* Stripe API inputs */);
    if (event.type === 'checkout.session.completed') {
        // If Stripe payment is completed, fulfill the order.
        await fulfillOrder(event);
    } else if (event.type === 'checkout.session.expired') {
        // Otherwise, mark the order as failed and return reserved inventory.
        await undoOrder(event);
    }
});

It's clear this checkout flow needs to be atomic. Either you reserve inventory and submit payment exactly once, the payment succeeds, and you fulfill the order; or the payment is canceled or fails, any reserved inventory is released, and you don't fulfill the order. Any other outcome, such as payment being submitted twice or payment succeeding but the order not being fulfilled, is clearly bad. However, there are many ways this could go wrong: in particular, the server could be restarted at any point during either handler's execution or the Stripe notification could fail to deliver. Either of these could leave the application in an inconsistent state.

So, how do you make this checkout flow work properly? You can't just wrap it in a database transaction because a critical operation—the Stripe payment—is happening externally.

Solutions

Solution #1 - Code a state machine?

One possible solution is to rewrite this application as a state machine. You can define every stage (creating an order, reserving inventory...) in the checkout flow as a different state and record state transitions in the database. Then, to handle a service interruption, you can have a new server look up the state of the interrupted server's orders and pick up where it left off. To handle Stripe notifications not arriving, you can set up a background process that checks for orders stuck in the "waiting for notification" state and ping Stripe to figure out what happened.

Solution #2 - Use a workflow engine?

State machines are a powerful and effective abstraction, but they're hard to implement: you need to build a robust executor and design your program carefully so that each stage is atomic. As an alternative, the emerging class of reliable workflow or ALTP engines, like AWS Step Functions or Azure Durable Functions, guarantee your programs always run to completion. However, because they treat each stage of your program as a black box, they can't guarantee your operations execute exactly once. Thus, while recovering a failed program they may execute individual operations like allocating inventory or submitting payments multiple times.

Solution #3 - Try DBOS Cloud early access and let us know what you think

Reliability in stateful applications is a problem we care a lot about at DBOS. Building on research prototypes from our academic project, DBOS stores the status of an in-progress program in the database so it can automatically resume any interrupted program from where it left off. By updating program state inside of application transactions, we can guarantee operations execute once and only once. We'll expand on how this works in a future blog post, but if you're interested in learning more, please check out the open-source DBOS Transact framework for guaranteed exactly-once Python and TypeScript code execution or use the DBOS Cloud serverless computing platform for free.