How to think about durable execution

(hatchet.run)

85 points | by abelanger 7 days ago

11 comments

  • coreylane 2 hours ago
    I've also been struggling to wrap my head around durable execution hype and whether my workload would benefit, maybe after sleeping on this post it will be clear.

    A bit off-topic, but I recently switched from Celery to Hatchet. I haven't even fully explored everything it can do, but the change has already made me a fan. Overall simplified my stack and made several features easier to implement.

    Some takeaways from my experience

    1. Streaming — My application provides real-time log streaming to users (similar to GitHub Actions or AWS CodeBuild). With Celery, I had to roll my own solution using Supabase Realtime. Hatchet’s streaming is straightforward: my frontend now connects to a simple SSE endpoint in my API that forwards the Hatchet stream.

    2. Dynamic cron scheduling — Celery requires a third-party tool like RedBeat for user-defined schedules. Hatchet supports this natively.

    3. Logs — Hatchet isolates logs per task out of the box, which is much easier to work with.

    4. Worker affinity — Hatchet’s key-value tags on workers and workflows allow dynamic task assignment based on worker capabilities. For example, a customer requiring 10 Gbps networking can have tasks routed to workers tagged {'network_speed': 10}. This would have required custom setup in Celery.

    5. Cancellation — Celery has no graceful way to cancel in-flight tasks without risking termination of the entire worker process (Celery docs note that terminate=True is a “last resort” that sends SIGTERM to the worker). Hatchet handles cancellation more cleanly.

  • teeray 5 hours ago
    > This task is not easily idempotent; it involves writing a ton of intermediate state and queries to determine that a step should not be repeated

    The problem with durable execution is that your entire workflow still needs to be idempotent. Consider that each workflow is divided into a sequence of steps that amount to: 1) do work 2) record the fact that work was done. If 2) never happens because the worker falls over, you must repeat 1). Therefore, for each step, "doing work" happens at least once. Given that steps compose, and each execute at least once, it follows that the entire workflow executes at least once. Because it doesn't execute exactly once, everything you write in a durable execution engine must be idempotent.

    At that point, the only thing the durable execution engine is buying you is an optimization against re-running some slow tasks. That may be valuable in itself. However, this doesn't change anything about good practices writing async worker tasks.

    • maxmcd 5 hours ago
      I think a lot of the original temporal/cadence authors were motivated by working on event-driven systems with retries. They exhibited complex failure scenarios that they could not reasonably account for without slapping on more supervisor systems. Durable executions allow you to have a consistent viewpoint to think about failures.

      I agree determinism/idempotency and the complexities of these systems are a tough pill to swallow. Certainly need to be suited to the task.

    • kodablah 5 hours ago
      > that your entire workflow still needs to be idempotent

      If just meaning workflow logic, as the article mentions it has to be deterministic, which implies idempotency but that is fine because workflow logic doesn't have side effects. But the side-effecting functions invoked from a workflow (what Temporal dubs "activities") of course _should_ be idempotent so they can be retried upon failure, as is the case for all retryable code, but this is not a requirement. These side effecting functions can be configured at the callsite to have at-most-once semantics.

      In addition to many other things like observability, the value of durable execution is persisted advanced logic like loops, try/catch, concurrent async ops, sleeping, etc and making all of that crash proof (i.e. resumes from where it left off on another machine).

    • jedberg 4 hours ago
      > The problem with durable execution is that your entire workflow still needs to be idempotent.

      Yes, but what that means depends on your durability framework. For example, the one that my company makes can use the same database for both durability and application data, so updates to application data can be wrapped in the same database transaction as the durability update. This means "the work" isn't done unless "recording the work" is also done. It also means they can be undone together.

      • nightpool 4 hours ago
        A lot of work can be wrapped inside a database transaction, but never everything. You're always going to want to interact with external APIs eventually.
        • jedberg 3 hours ago
          Yes of course. External calls still need to be idempotent. But the point is some frameworks allow you to make some, or even most, of your work safe for durable execution by default.
      • teeray 2 hours ago
        > This means "the work" isn't done unless "recording the work" is also done. It also means they can be undone together.

        That's just another way of saying that the step in question is idempotent.

        • jedberg 1 hour ago
          No it's different. Idempotent would mean that it can be replayed with no effect. What I'm saying is that this guarantees exactly once execution, taking advantage of the database transactions that make multiple data updates idempotent together.
  • dminor 5 hours ago
    We recently started using DBOS for durable execution - it's much easier to integrate than Temporal and it Just Uses Postgres(tm), which is nice.
    • smurda 3 hours ago
      I've been using Restate for durable execution. The TypeScript SDK was easy to use.
    • MasterJJ 4 hours ago
      Have you compared with LittleHorse.io? Seems like that would be durable and easier for workflows and retries etc
    • nzoschke 4 hours ago
      Another happy DBOS user here. It slides right into our existing Postgres usage and has a simple Go SDK.
  • vouwfietsman 6 hours ago
    For me the main issue with these systems is that its still seen as a special case of backend execution. I think the real value is just admitting that every POST/PUT should kick off a durable execution, but that doesn't seem to match the design, which considers these workflows quite heavy and expensive, and bases its price on it.

    What we need is an opinionated framework that doesn't allow you to do anything except durable workflows, so your junior devs stop doing two POSTs in a row thinking things will be OK.

    • abelanger 4 hours ago
      The "constraining functions to only be durable" idea is really interesting to me and would solve the main gotcha of the article.

      It'd be an interesting experiment to take memory snapshots after each step in a workflow, which an API like Firecracker might support, but likely adds even more overhead than current engines in terms of expense and storage. I think some durable execution engines have experimented with this type of system before, but I can't find a source now - perhaps someone has a link to one of these.

      There's also been some work, for example in the Temporal Python SDK, to overwrite the asyncio event loop to make regular calls like `sleep` work as durable calls instead, to reduce the risk to developers. I'm not sure how well this generalizes.

      • vouwfietsman 3 hours ago
        Ok, I'm not an expert here, you most likely are, but just my 2 cents on your response: I would very much argue to not make this magic. e.g:

        > take memory snapshots after each step in a workflow

        Don't do this. Just give people explicit boundaries of where their snapshots occur, and what is snapshotted, so they have control both over durability and performance. Make it clear to people that everything should be in the chain of command of the snapshotting framework: e.g no file-local or global variables. This is already how people program web services but somehow nobody leans into it.

        The thing is, if you want people to understand durability but you also hide it from them, it will actually be much more complicated to understand and work with a framework.

        The real golden ticket I think is to make readable intuitive abstractions around durability, not hide it behind normal-looking code.

        Please steal my startup.

        • vouwfietsman 3 hours ago
          Just to continue the idea: you wouldn't be constraining or tagging functions, you would relinquish control to a system that closely guards how you produce side effects. e.g doing a raw HTTP request from a task is prohibited, not intercepted.
    • pests 2 hours ago
      Doesn't Google have a similar type system for stuff like this? I recall an old engineering blog / etc that detailed how they handled this at scale.
    • Kinrany 2 hours ago
      This would look like a handler taking an IO token that provides a memoizing get_or_execute function, plus utilities for calling these handlers, correct?
  • strken 5 hours ago
    I still don't get it. External API calls aren't deterministic. You're still left with the choice between making them idempotent or potentially performing an action twice (or more!), and I don't see how durable execution helps you.
    • jdpedrie 5 hours ago
      Those sort of flaky or non-deterministic steps are written as activities, not as part of the deterministic workflow. The orchestrator will retry the non-deterministic activity until it gets a usable output (expected error, success), and record the activity output. If the workflow replays (i.e. worker crash), that recorded output of the activity will be returned instead of executing the activity again.
  • chuckhend 6 hours ago
    Helpful read, thanks for sharing! We have been (slowly) working on some fairness related queueing features over in pgmq for a bit too https://github.com/pgmq/pgmq/pull/442. It does get complicated on Postgres IMO.
  • willcodeforfoo 5 hours ago
    Unrelated, but gorgeous website!
  • maxmcd 5 hours ago
    Great article for demystifying durable execution: https://lucumr.pocoo.org/2025/11/3/absurd-workflows/
  • cammil 3 hours ago
    I don't see how the frameworks solve anything more than organising some tasks in a sequence. The underlying tasks still have to be idempotent.

    Please do refute. I'm genuinely interested in this problem as I deal with it daily.

    • abelanger 3 hours ago
      Sure, I'll bite. Task-level idempotency is not the problem that durable execution platforms are solving. The core problem is the complexity that arises when one part of your async job becomes distributed: the two common ones are distributed runtime (compute) and distributed application state.

      Let's just take the application state side. If your entire async job can be modeled as a single database transaction, you don't need a durable execution platform, you need a task queue with retries - our argument at Hatchet is that this is many (perhaps most) async workloads, which is why the durable task queue is the primary entrypoint to Hatchet, and durable execution is only a feature for more complex workloads.

      But once you start to distribute your application state - for example, different teams building microservices which don't share the same database - you have a new set of problems. The most difficult edge case here is not the happy path with multiple successful writes, it's distributed rollbacks: a downstream step fails and you need to undo the upstream step in a different system. In these systems, you usually introduce an "orchestrator" task which catches failures and figures out how to undo the system in the right way.

      It turns out these orchestrator functions are hard to build, because the number of failure scenarios are many. So this is why durable execution platforms place some constraints on the orchestrator function, like determinism, to reduce the number of failure scenarios to an amount easy to reason about.

      There are other scenarios other than distributed rollbacks that lead to durable execution, it turns out to be a useful and flexible model for program state. But this is a common one.

  • phrotoma 7 hours ago
    Ballsy for a founder to come right out and (roughly) say "yeah anyway fuck vendors" on the corporate blog. Points for honesty.
    • abelanger 7 hours ago
      Hah, well I'll avoid _talking to_ vendors, more specifically I'll avoid talking to salespeople selling a technical product until we're pretty deep in the product. I do tend to not use vendors that don't have a good self-serve path or mechanism to get my technical questions answered.
  • immibis 7 hours ago
    Isn't durable execution just another one of these frameworks that promises to make everything easy if you reorganise all your code into the framework?