I've also been struggling to wrap my head around durable execution hype and whether my workload would benefit, maybe after sleeping on this post it will be clear.
A bit off-topic, but I recently switched from Celery to Hatchet. I haven't even fully explored everything it can do, but the change has already made me a fan. Overall simplified my stack and made several features easier to implement.
Some takeaways from my experience
1. Streaming — My application provides real-time log streaming to users (similar to GitHub Actions or AWS CodeBuild). With Celery, I had to roll my own solution using Supabase Realtime. Hatchet’s streaming is straightforward: my frontend now connects to a simple SSE endpoint in my API that forwards the Hatchet stream.
2. Dynamic cron scheduling — Celery requires a third-party tool like RedBeat for user-defined schedules. Hatchet supports this natively.
3. Logs — Hatchet isolates logs per task out of the box, which is much easier to work with.
4. Worker affinity — Hatchet’s key-value tags on workers and workflows allow dynamic task assignment based on worker capabilities. For example, a customer requiring 10 Gbps networking can have tasks routed to workers tagged {'network_speed': 10}. This would have required custom setup in Celery.
5. Cancellation — Celery has no graceful way to cancel in-flight tasks without risking termination of the entire worker process (Celery docs note that terminate=True is a “last resort” that sends SIGTERM to the worker). Hatchet handles cancellation more cleanly.
> This task is not easily idempotent; it involves writing a ton of intermediate state and queries to determine that a step should not be repeated
The problem with durable execution is that your entire workflow still needs to be idempotent. Consider that each workflow is divided into a sequence of steps that amount to: 1) do work 2) record the fact that work was done. If 2) never happens because the worker falls over, you must repeat 1). Therefore, for each step, "doing work" happens at least once. Given that steps compose, and each execute at least once, it follows that the entire workflow executes at least once. Because it doesn't execute exactly once, everything you write in a durable execution engine must be idempotent.
At that point, the only thing the durable execution engine is buying you is an optimization against re-running some slow tasks. That may be valuable in itself. However, this doesn't change anything about good practices writing async worker tasks.
I think a lot of the original temporal/cadence authors were motivated by working on event-driven systems with retries. They exhibited complex failure scenarios that they could not reasonably account for without slapping on more supervisor systems. Durable executions allow you to have a consistent viewpoint to think about failures.
I agree determinism/idempotency and the complexities of these systems are a tough pill to swallow. Certainly need to be suited to the task.
> that your entire workflow still needs to be idempotent
If just meaning workflow logic, as the article mentions it has to be deterministic, which implies idempotency but that is fine because workflow logic doesn't have side effects. But the side-effecting functions invoked from a workflow (what Temporal dubs "activities") of course _should_ be idempotent so they can be retried upon failure, as is the case for all retryable code, but this is not a requirement. These side effecting functions can be configured at the callsite to have at-most-once semantics.
In addition to many other things like observability, the value of durable execution is persisted advanced logic like loops, try/catch, concurrent async ops, sleeping, etc and making all of that crash proof (i.e. resumes from where it left off on another machine).
> The problem with durable execution is that your entire workflow still needs to be idempotent.
Yes, but what that means depends on your durability framework. For example, the one that my company makes can use the same database for both durability and application data, so updates to application data can be wrapped in the same database transaction as the durability update. This means "the work" isn't done unless "recording the work" is also done. It also means they can be undone together.
A lot of work can be wrapped inside a database transaction, but never everything. You're always going to want to interact with external APIs eventually.
Yes of course. External calls still need to be idempotent. But the point is some frameworks allow you to make some, or even most, of your work safe for durable execution by default.
No it's different. Idempotent would mean that it can be replayed with no effect. What I'm saying is that this guarantees exactly once execution, taking advantage of the database transactions that make multiple data updates idempotent together.
For me the main issue with these systems is that its still seen as a special case of backend execution. I think the real value is just admitting that every POST/PUT should kick off a durable execution, but that doesn't seem to match the design, which considers these workflows quite heavy and expensive, and bases its price on it.
What we need is an opinionated framework that doesn't allow you to do anything except durable workflows, so your junior devs stop doing two POSTs in a row thinking things will be OK.
The "constraining functions to only be durable" idea is really interesting to me and would solve the main gotcha of the article.
It'd be an interesting experiment to take memory snapshots after each step in a workflow, which an API like Firecracker might support, but likely adds even more overhead than current engines in terms of expense and storage. I think some durable execution engines have experimented with this type of system before, but I can't find a source now - perhaps someone has a link to one of these.
There's also been some work, for example in the Temporal Python SDK, to overwrite the asyncio event loop to make regular calls like `sleep` work as durable calls instead, to reduce the risk to developers. I'm not sure how well this generalizes.
Ok, I'm not an expert here, you most likely are, but just my 2 cents on your response: I would very much argue to not make this magic.
e.g:
> take memory snapshots after each step in a workflow
Don't do this. Just give people explicit boundaries of where their snapshots occur, and what is snapshotted, so they have control both over durability and performance. Make it clear to people that everything should be in the chain of command of the snapshotting framework: e.g no file-local or global variables. This is already how people program web services but somehow nobody leans into it.
The thing is, if you want people to understand durability but you also hide it from them, it will actually be much more complicated to understand and work with a framework.
The real golden ticket I think is to make readable intuitive abstractions around durability, not hide it behind normal-looking code.
Just to continue the idea: you wouldn't be constraining or tagging functions, you would relinquish control to a system that closely guards how you produce side effects. e.g doing a raw HTTP request from a task is prohibited, not intercepted.
This would look like a handler taking an IO token that provides a memoizing get_or_execute function, plus utilities for calling these handlers, correct?
I still don't get it. External API calls aren't deterministic. You're still left with the choice between making them idempotent or potentially performing an action twice (or more!), and I don't see how durable execution helps you.
Those sort of flaky or non-deterministic steps are written as activities, not as part of the deterministic workflow. The orchestrator will retry the non-deterministic activity until it gets a usable output (expected error, success), and record the activity output. If the workflow replays (i.e. worker crash), that recorded output of the activity will be returned instead of executing the activity again.
Helpful read, thanks for sharing! We have been (slowly) working on some fairness related queueing features over in pgmq for a bit too https://github.com/pgmq/pgmq/pull/442. It does get complicated on Postgres IMO.
Sure, I'll bite. Task-level idempotency is not the problem that durable execution platforms are solving. The core problem is the complexity that arises when one part of your async job becomes distributed: the two common ones are distributed runtime (compute) and distributed application state.
Let's just take the application state side. If your entire async job can be modeled as a single database transaction, you don't need a durable execution platform, you need a task queue with retries - our argument at Hatchet is that this is many (perhaps most) async workloads, which is why the durable task queue is the primary entrypoint to Hatchet, and durable execution is only a feature for more complex workloads.
But once you start to distribute your application state - for example, different teams building microservices which don't share the same database - you have a new set of problems. The most difficult edge case here is not the happy path with multiple successful writes, it's distributed rollbacks: a downstream step fails and you need to undo the upstream step in a different system. In these systems, you usually introduce an "orchestrator" task which catches failures and figures out how to undo the system in the right way.
It turns out these orchestrator functions are hard to build, because the number of failure scenarios are many. So this is why durable execution platforms place some constraints on the orchestrator function, like determinism, to reduce the number of failure scenarios to an amount easy to reason about.
There are other scenarios other than distributed rollbacks that lead to durable execution, it turns out to be a useful and flexible model for program state. But this is a common one.
Hah, well I'll avoid _talking to_ vendors, more specifically I'll avoid talking to salespeople selling a technical product until we're pretty deep in the product. I do tend to not use vendors that don't have a good self-serve path or mechanism to get my technical questions answered.
A bit off-topic, but I recently switched from Celery to Hatchet. I haven't even fully explored everything it can do, but the change has already made me a fan. Overall simplified my stack and made several features easier to implement.
Some takeaways from my experience
1. Streaming — My application provides real-time log streaming to users (similar to GitHub Actions or AWS CodeBuild). With Celery, I had to roll my own solution using Supabase Realtime. Hatchet’s streaming is straightforward: my frontend now connects to a simple SSE endpoint in my API that forwards the Hatchet stream.
2. Dynamic cron scheduling — Celery requires a third-party tool like RedBeat for user-defined schedules. Hatchet supports this natively.
3. Logs — Hatchet isolates logs per task out of the box, which is much easier to work with.
4. Worker affinity — Hatchet’s key-value tags on workers and workflows allow dynamic task assignment based on worker capabilities. For example, a customer requiring 10 Gbps networking can have tasks routed to workers tagged {'network_speed': 10}. This would have required custom setup in Celery.
5. Cancellation — Celery has no graceful way to cancel in-flight tasks without risking termination of the entire worker process (Celery docs note that terminate=True is a “last resort” that sends SIGTERM to the worker). Hatchet handles cancellation more cleanly.
The problem with durable execution is that your entire workflow still needs to be idempotent. Consider that each workflow is divided into a sequence of steps that amount to: 1) do work 2) record the fact that work was done. If 2) never happens because the worker falls over, you must repeat 1). Therefore, for each step, "doing work" happens at least once. Given that steps compose, and each execute at least once, it follows that the entire workflow executes at least once. Because it doesn't execute exactly once, everything you write in a durable execution engine must be idempotent.
At that point, the only thing the durable execution engine is buying you is an optimization against re-running some slow tasks. That may be valuable in itself. However, this doesn't change anything about good practices writing async worker tasks.
I agree determinism/idempotency and the complexities of these systems are a tough pill to swallow. Certainly need to be suited to the task.
If just meaning workflow logic, as the article mentions it has to be deterministic, which implies idempotency but that is fine because workflow logic doesn't have side effects. But the side-effecting functions invoked from a workflow (what Temporal dubs "activities") of course _should_ be idempotent so they can be retried upon failure, as is the case for all retryable code, but this is not a requirement. These side effecting functions can be configured at the callsite to have at-most-once semantics.
In addition to many other things like observability, the value of durable execution is persisted advanced logic like loops, try/catch, concurrent async ops, sleeping, etc and making all of that crash proof (i.e. resumes from where it left off on another machine).
Yes, but what that means depends on your durability framework. For example, the one that my company makes can use the same database for both durability and application data, so updates to application data can be wrapped in the same database transaction as the durability update. This means "the work" isn't done unless "recording the work" is also done. It also means they can be undone together.
That's just another way of saying that the step in question is idempotent.
What we need is an opinionated framework that doesn't allow you to do anything except durable workflows, so your junior devs stop doing two POSTs in a row thinking things will be OK.
It'd be an interesting experiment to take memory snapshots after each step in a workflow, which an API like Firecracker might support, but likely adds even more overhead than current engines in terms of expense and storage. I think some durable execution engines have experimented with this type of system before, but I can't find a source now - perhaps someone has a link to one of these.
There's also been some work, for example in the Temporal Python SDK, to overwrite the asyncio event loop to make regular calls like `sleep` work as durable calls instead, to reduce the risk to developers. I'm not sure how well this generalizes.
> take memory snapshots after each step in a workflow
Don't do this. Just give people explicit boundaries of where their snapshots occur, and what is snapshotted, so they have control both over durability and performance. Make it clear to people that everything should be in the chain of command of the snapshotting framework: e.g no file-local or global variables. This is already how people program web services but somehow nobody leans into it.
The thing is, if you want people to understand durability but you also hide it from them, it will actually be much more complicated to understand and work with a framework.
The real golden ticket I think is to make readable intuitive abstractions around durability, not hide it behind normal-looking code.
Please steal my startup.
Please do refute. I'm genuinely interested in this problem as I deal with it daily.
Let's just take the application state side. If your entire async job can be modeled as a single database transaction, you don't need a durable execution platform, you need a task queue with retries - our argument at Hatchet is that this is many (perhaps most) async workloads, which is why the durable task queue is the primary entrypoint to Hatchet, and durable execution is only a feature for more complex workloads.
But once you start to distribute your application state - for example, different teams building microservices which don't share the same database - you have a new set of problems. The most difficult edge case here is not the happy path with multiple successful writes, it's distributed rollbacks: a downstream step fails and you need to undo the upstream step in a different system. In these systems, you usually introduce an "orchestrator" task which catches failures and figures out how to undo the system in the right way.
It turns out these orchestrator functions are hard to build, because the number of failure scenarios are many. So this is why durable execution platforms place some constraints on the orchestrator function, like determinism, to reduce the number of failure scenarios to an amount easy to reason about.
There are other scenarios other than distributed rollbacks that lead to durable execution, it turns out to be a useful and flexible model for program state. But this is a common one.