Making AI Agents Stateful and Fault-tolerant on Google Cloud Run

Serverless platforms like Google Cloud Run are appealing to teams who want to ship fast without maintaining infrastructure. However, Cloud Run is also stateless, which presents challenges to those who seek to use it as a deployment platform for agentic AI applications. This is because AI agents execute multi-step workflows, are inherently stateful, and can be error prone in distributed environments.

Agentic workflows require statefulness

AI agents execute workflows that are composed of steps, including both LLM and tool calls. Often, tool calls have side effects like writing to a database or calling an external service. If an agent is interrupted while processing a request, it can lead to inconsistent states with poor business outcomes, like a customer support agent promising to process a refund but not following through. Further, many agentic workflows are long running (e.g., onboarding workflows) and bursty (e.g., when your agents provide customer support during Black Friday).

That’s why reliable agents need solutions to handle interruptions and handle shifting load patterns. Agents' steps execution should be recorded (so state can be recovered following a failure), retried automatically, and scaled with strict concurrency policies. 

Why it’s hard to run AI agents on Cloud Run

As a stateless, serverless platform, Cloud Run, on its own, is not a good fit for stateful AI agents.

First, Cloud Run has tight timeouts (up to 1 hour), after which instances are terminated. Cloud Run does not persist instances' in-memory state when they are terminated, losing track of all work in progress.

Second, Cloud Run can rapidly scale to many concurrent instances which can overwhelm external resources like your database, or trigger API rate limitation failures.

Scaling production-ready AI agents on Cloud Run requires durable workflows that can automatically resume where they left off and be precisely scaled.

Making serverless Cloud Run stateful with DBOS

DBOS is an open source library that stores workflow and queuing state in a Postgres database, thereby making your code durable and observable by default–no matter where it runs. Specifically, DBOS addresses the state management and scaling controls lacking from Cloud Run. When an instance times out, DBOS automatically resumes the workflows it was running on a new instance. DBOS durable queues can be finely tuned for concurrency management and rate limiting.

Because scaling also involves tuning the min/max number of instances your Cloud Run deployment should have, you can leverage DBOS load observability APIs to dynamically adjust these parameters, from within your application, using a DBOS scheduled workflow.

Furthermore, you can use DBOS to safely update agentic code by managing agent versions or dynamically patching agent code. This is useful because long- lived agents often need to keep executing a specific version of the code, while the code base itself is rapidly evolving, resulting in many agents running different code versions. DBOS versions can match Cloud Run revisions and consequently work in tandem with Cloud Run gradual rollouts capabilities.

Finally, you can leverage DBOS Conductor, an out-of-band, agent control plane, to monitor and manage agentic workflows and queues.

How Dosu runs DBOS workflows on Cloud Run

For the engineers building Dosu, AI-native knowledge infrastructure for teams and agents, Cloud Run offered a cost-efficient, scalable infrastructure. However, ramping up their service in production, reliability and scalability became first-class concerns. They needed a way to make task queues and agent workflows operating at scale resilient to failures.

When Dosu looked at the incumbent in the space, Temporal, they deemed it too complicated–the opposite of the lightweightness that serverless embodies. The Dosu team learned of DBOS on HackerNews and gave it a try.

Dosu runs tens of thousands of agentic workflows per hour. AI agents are composed of many steps, which process development project assets, respond to user questions about the project, and so on. Partial execution leads to inconsistent states, creating troubleshooting and API cost overhead for Dosu. Further, some of Dosu’s workflows can be very long running, e.g., onboarding workflows, and bursty, e.g., when your Claude-enabled team pushes thousands of PRs daily.

The Dosu team was already all-in on Postgres and was able to get DBOS running in a single day with their existing infrastructure. In fact, the only infrastructure change required was to enable instance-based billing rather than the default request-based billing in Cloud Run, which allows DBOS workflows to run in the background while there aren’t any active HTTP requests.

Dosu quickly migrated all their agentic workflows to DBOS with zero downtime. They were glad to find out that the shape of their workload–occasional bursts of CPU and HTTP requests–auto-scaled to meet demand well with the default Cloud Run auto-scaling policy.

Once everything was migrated the next step was optimization and monitoring tooling. They created a dedicated Cloud SQL Postgres instance to further increase the throughput of DBOS workflows and give connection overhead for scale outs. Lastly, they built out custom monitoring dashboards in Grafana by querying the DBOS tables directly.

Learn more

Today, Dosu is used by over 50,000 software projects, including rapidly growing open-source standouts like BetterAuth, Poetry, LlamaIndex, Apache Airflow, and Zod. 

If you want to learn more about DBOS and how to deploy to Cloud run, check out the DBOS docs.

Insights

Recent articles

The latest in durable execution, AI workflows & more.

Product news
Apr 13, 2026

DBOS Enhancements - April 2026

Overview of new DBOS durable workflow execution and workflow ops features, including cross-language workflow interoperability, a new metadata-only mode for increased data privacy, LlamaIndex and Databricks LakeBase integrations, and more.
Qian Li
How To
Apr 7, 2026

Building Durable Agents with DBOS and Databricks

New DBOS and Databricks partnership makes AI Agent behavior reliable, reproducible, and observable
Peter Kraft
How To
Apr 1, 2026

Async Python is Secretly Deterministic

How Python async functions work, and how to produce async workflows that execute steps concurrently, but with a deterministic, durable execution order.
Peter Kraft
Qian Li