Soria Analytics: AI-driven Healthcare Data Processing

About Soria Analytics

Soria Analytics provides a unified data-intelligence platform for the healthcare-services industry. The company aggregates and normalizes hundreds of public and commercial data sources — from PDFs and spreadsheets to APIs and regulatory releases — into continuously updated, analytics-ready datasets. With real-time alerts, natural-language search, and full data lineage, Soria enables analysts, investors, and operators to access durable insights without manual data collection or cleanup.

Use Case: Healthcare AI data preparation workflows

Soria continuously monitors 300+ data sources, representing decades of changing file formats, shifting schemas, and thousands of healthcare-company subsidiaries. These workflows must run reliably and in parallel. When a government dataset updates, Soria wants to detect the change, pull all new files, clean them, map evolving schema, and load into BigQuery — without manual intervention or brittle one-off scripts.

‍

Challenges: Problems building and observing queues with Celery

Celery queuing lacks workflow primitives

Soria’s ingestion pipelines scrape hundreds of government sources, parse inconsistent file formats, reconcile decades of schema drift, and fan out into hundreds of parallel mapping and cleaning tasks. Initially, they tried using Celery to enqueue work, but it lacked the workflow primitives needed to model multi-step, long-running pipelines. As the platform grew, chaining tasks together became fragile and difficult to reason about.

‍

‍“Trying to build our ingestion engine on Celery got ugly fast once we needed multi-step, highly parallelized workflows,” Cameron Spiller, CTO of Soria Analytics.

‍

Limited queuing observability with Celery

Healthcare data changes unpredictably — columns are renamed, formats shift, new regulatory fields appear. When something failed, Celery only showed the broken task, not the workflow it belonged to. The team couldn’t easily trace lineage or understand where failures originated, making debugging slow and operationally expensive.

‍

Celery queuing infrastructure burdens

While Soria built most of its product with lightweight, modern infrastructure, Celery required its own ecosystem: a persistent Redis or RabbitMQ cluster, dedicated workers, autoscaling logic, and separate monitoring. Maintaining two parallel infrastructures introduced unnecessary friction for a small team focused on speed and reliability.

‍

Solution: Replacing Celery with DBOS:

Soria enhanced its ingestion pipeline code with the open source DBOS Transact library, giving them instant, end-to-end workflow durability and visibility. DBOS allowed the team to orchestrate scraping, cleaning, mapping, and ingestion as natural Python workflows.

‍

Fully managed distributed workflows with no extra infrastructure

Switching to the DBOS durable workflow orchestration library eliminated the need for Celery’s Redis, worker clusters, or a dedicated orchestrator. The team simply deploys Python code, and DBOS handles concurrency, retries, and durability automatically, and it uses their existing Postgres database to store workflow state. CI/CD became faster, rollbacks became trivial, and no one on the team has had to think about orchestration plumbing in months.

‍

End-to-end workflow observability

Using DBOS Conductor, Soria gained real workflow-level observability, something they struggled with in Celery. Engineers can now monitor workflow state, inspect stuck jobs, and understand failures in context, dramatically reducing debugging time and increasing reliability.

‍

Results: Failure-proof, scalable AI workflow orchestration with DBOS

‍

‍“We’re a tiny team, and DBOS let us move fast without running more infrastructure. It gave us durable orchestration and real visibility, with almost no overhead.” — Cameron Spiller, CTO, Soria Analytics