Remix.run Logo
lelandbatey 3 days ago

While this does talk about durable queues, this post really hinges on a pivot to evangelizing durable workflows. It makes it sound like they're synonymous, when they're very much not. Specifically, this sentence bringing up workflows for the first time (emphasis mine):

> "Durable queues were rare when I was at Reddit, but they’re more and more popular now. Essentially, they work by combining task queues with durable workflows, helping you reliably orchestrate workflows of many parallel tasks."

That makes it sound like task queues + durable workflows = durable queues, but that's not true at all. A durable queue is literally a queue that doesn't drop messages e.g. during an unexpected shutdown. That's all. Durable workflows are a pretty different thing. A durable queue could be used just like a normal queue, but while you can't build a durable workflow on a normal queue (or at least, it would be a huge pain), a durable queue makes it vastly simpler to build a durable workflow engine. I think the article talks about durable workflows because this is DBOS, a company looking to sell durable workflow services, but also because durable workflows are considered by many to be a kind of "holy grail" of big-business applications as they seem like they can allow for you to write code that's kind of "always running", where the state in memory is persisted to a DB invisibly so that you have to think less about CRUD. The killer app of durable workflows seems to me to writing orchestration code for really long-running processes which have to do lots of distributed stuff, as it allows you to write mostly normal looking code which does things like "wait for this thing to finish, even if that thing will be finished in a week", which is a pretty cool thing to see.

What are durable workflows? On the technical side, I'd describe durable workflows as being more like a system of cooperative multitasking where you serialize your state/inputs/outputs to a durable store at each yield/suspension point. Since you're tracking state at yield points and not at the individual instruction level, the workflow engine tracks work state less granularly than traditional single-process computing. Due to the more coarse tracking of units of execution, I think of durable workflows more like async runtimes which serialize their progress. The hidden downside to durable workflows then is that it means you have to write odd-looking code to fit into that custom async runtime. For example, since the unit of execution is coarse, you have to assume the code between the checkpoints could potentially be run multiple times if e.g. a worker only gets halfway through executing the next "chunk" but shuts down unexpectedly before finishing. Thus you have to assume at-least-once execution instead of our typical "exactly once" execution when thinking about single lines of code. Additionally, since while some languages are built to support custom async runtimes, even the ones which do don't have sufficient flexibility to allow language-level support for the extremely weird distributed async runtimes you'd need to build a durable workflow engine. Because of that, once you get down to it you're basically going to have build your code out of callbacks that you register with the custom workflow-engine library of the provider you're using. This is the biggest wart of building on durable workflow platforms, as they pretty much all have you write code that looks like this:

    function do_thing(foo): bar {} // turns foo into bar, w/ side effects
    var result bar // result is type bar
    result = worfklow.execute(do_thing, fooInput)

    // In "normal" non-durable-workflow code
    // you'd instead just call do_thing() like:
    result = do_thing(fooInput)

That's another detail left out of the parent article: there are actually a ton of these durable workflow platforms, of which DBOS is only one. I think the biggest in the online space is probably Temporal (the one I'm currently using at $DAYJOB), but there's others as well. Here's a short list.

- Temporal https://temporal.io/

- DBOS https://www.dbos.dev/

- Inngest https://www.inngest.com/uses/durable-workflows

- Restate https://restate.dev/

Anyway, thanks for coming to my TED talk, I hope you've learned about this fascinating developing corner of software, and I can't wait for someone to build first-party language support for pluggable durable execution runtimes into languages we like. Then we can get rid of callback nonsense and start a whole NEW hype cycle around this technology!

jedberg 3 days ago | parent | next [-]

> I think the biggest in the online space is probably Temporal (the one I'm currently using at $DAYJOB), but there's others as well.

The reason none of the others were mentioned is because they all work very differently than DBOS. All of those others require an external durability coordinator, and require you to rewrite your application to work around how they operate.

DBOS is a library that does its durability work in process and uses the application database to store the durability state. This means the latency is much smaller, and the reliability is much higher because there aren't extra moving parts in the critical path that can go down.

Here is a page about this difference: https://docs.dbos.dev/architecture

lelandbatey 3 days ago | parent [-]

That's a very appealing approach from a developer ergonomics perspective; it'd be very nice to only have to deploy your own application and not also deploy a coordinator.

You mention that you don't have to rewrite your application to work around how DBOS operates. That seems somewhat true, but I think DBOS still requires folks to rewrite their code around a custom runtime. Looking at the Python code on your home page, it seems like you're leveraging Python's decorators to make the "glue code" less prominent (registering functions with the async executor, telling the async system to invoke certain registered functions), but the glue code is still there. If I go look at the DBOS library for Golang[1] for example, since Golang doesn't have decorators in the same way Python does, we still have to have code doing the kind of "manual callback" style I mentioned:

    // code is massively paraphrased for brevity, err checks removed
    func workflow(dbosCtx dbos.DBOSContext, _ string) (string, error) {
        _, err := dbos.RunAsStep(dbosCtx, func(ctx) (string, error) { return stepOne(ctx) })
        return dbos.RunAsStep(dbosCtx, func(ctx) (string, error) { return stepTwo(ctx) })
    }
    func main() {
        // Initialize a DBOS context
        dctx, err := dbos.NewDBOSContext(dbos.Config{ DatabaseURL: "...", AppName: "myapp", })
        // Register a workflow
        dbos.RegisterWorkflow(dctx, workflow)

        // Launch DBOS
        err = dctx.Launch()
        defer dctx.Cancel()

        // Run a durable workflow and get its result
        handle, err := dbos.RunWorkflow(dctx, workflow, "")
        res, err := handle.GetResult()
        fmt.Println("Workflow result:", res)
    }
I don't think that's a bad thing though, I think that's a good thing. I feel like positioning DBOS as a _library_ is an excellent choice, it's a huge ergonomics improvement. The choices so far seem like you're trying to make DBOS easy to adopt via appropriate amounts of convenience features, but not so much automagic that we-the-devs can't reason about what's going on. With developer reasoning in mind, I have some more questions for you!

In the architecture page you linked[2], you talk about versioning. Versioning with durable workflows is one of those super-annoying things which affect the entire paradigm, albiet only once you've already adopted the tech and start having to change/evolve/maintain workflows. In that doc, you say that with DBOS each application will only work on workflows started by application versions which match the current application version. For completing long-running workflows, the page says:

> To safely recover workflows started on an older version of your code, you should start a process running that code version.

Since one of the killer apps of durable workflows are, as I mentioned, typically long-running jobs, do you have any products/advice/documentation for this pattern of running multiple application versions, and how one might approach implementing this practice? If we're writing code which takes a week to complete and may exit and recover many times before finally completing, do you have advice on how to keep each version deployed till all the work for a version is completed? Looking at Temporal, when using their Worker versioning scheme they offer ways for users to look this information up in Temporal, but not much guidance on actually implementing the pattern. Looking at the DBOS docs about versioning, I see information about getting this information via e.g. Conductor, but I also do not see any info about actually implementing multiple-concurrent-worker-version deployment (which Temporal calls "rainbow deployments"). Is version management something y'all are thinking about improving the ergonomics of, in the same way you improved ergonomics by bringing the executor in-process?

Speaking about versioning, how does DBOS handle cases around bugfix versions? Where you deploy a version A, but A has a bug in it. You would like to make the fix and deploy that fix as version B, then ideally, run the remaining workflows for Version A using the code in Version B. It seems like "version forking"[3] is the only way to do this, but it also seems like it's a special operation that cannot be done via a code change; it must be done via the Conductor administration UI. Is there no way to do in-code version patching[4] like is done in Temporal?

Finally, what are the limits to usage of DBOS? As in, where does DBOS start to fall down? Are there guidelines on the maximum number of steps in a workflow before things start to get tricky? What about the maximum serialized size of the workflow/step parameters? I've been unable to find any of that information on your website.

Thanks for making such an interesting piece of technology, and thanks for answering questions!

[1] - https://github.com/dbos-inc/dbos-transact-golang

[2] - https://docs.dbos.dev/architecture

[3] - https://docs.dbos.dev/production/self-hosting/workflow-manag...

[4] - https://docs.temporal.io/develop/go/versioning#patching

qianli_cs 3 days ago | parent [-]

Those are great questions!

For versioning, we recommend keeping each version running until all workflows on that version are done. It's similar to a blue-green deployment: each process is tagged with one version, and all workflows in it share that version. You can list pending/enqueued workflows on the old version (UI or list_workflow programmatic API), and once that list drains, you can shut down the old processes. DBOS Cloud automates this, and we'll add more guidance for self-hosting.

For bugfixes, DBOS supports programmatic forking and other workflow management tools [1]. We deliberately don't support code patching because it's fragile and hard to test. For example, patches can pile up on long-running workflows and make debugging painful.

The main limit is the database (which you can control the size). DBOS writes workflow inputs, step outputs, and workflow outputs to it. There's no step limit beyond disk space. Postgres/SQLite allow up to 1 GB per field, but keeping inputs/outputs under ~2 MB helps performance. We'll add clearer guidelines to the docs.

Thanks again for all the thoughtful questions!

[1] https://docs.dbos.dev/python/reference/contexts#fork_workflo...

qianli_cs 3 days ago | parent | prev | next [-]

Thanks for sharing your insights! You nailed the key tradeoffs of most durable workflow systems. The callback-style programming model is exactly the pain point we aim to solve with DBOS.

Instead of forcing you into a custom async runtime, DBOS lets you keep writing normal functions (this is an example in Python):

    @DBOS.workflow()
    def do_thing(foo):
        return bar

    # You can still call the workflow function like this:
    result = do_thing(fooInput)
Under the hood, DBOS checkpoints inputs/outputs so it can recover after failure, but you don't have to restructure your code around callbacks. In Python and Java we use decorators/annotations so registration feels natural, while in Go/TypeScript there's a lightweight one-time registration step. Either way, you keep the synchronous call style you'd expect.

On top of that, DBOS also supports running workflows asynchronously or through queues, so you can start with a simple function call and later scale out to async/queued execution without changing your code. That's what the article was leading into.

lelandbatey 3 days ago | parent [-]

I think your use of Python decorators is a big usability improvement, with the point being that the glue is still there. You mention that in Go there's "a lightweight one-time registration step" but it seems like in addition to calling the registration steps, you also have to use `dbos.CallAsStep()` when calling sub-steps of a workflow, which is almost identical to the temporal Golang SDK which has you call `workflow.ExecuteActivity()`.

Can you explain what makes DBOS better to use in Golang vs Temporal?

hmaxdml 3 days ago | parent | next [-]

:wave: Hey there, I'm working on the Go library and just wanted to confirm your suspicion:

"since Golang doesn't have decorators in the same way Python does, we still have to have code doing the kind of "manual callback" style I mentioned"

That's exactly right, specifically for steps. We considered other ways to wrap the workflow calls (so you don't have to do dbos.RunWorkflow(yourFunction)), but they got in the way of providing compile time type checking.

As Qian said, under the hood the Golang SDK is an embedded orchestration package that just requires Postgres to automate state management.

For example, check the RunWorkflow implementation: https://github.com/dbos-inc/dbos-transact-golang/blob/0afae2...

It does all the durability logic in-line with your code and doesn't rely on an external service.

Thanks for taking the time to share your insights! This was one of the most interesting HN comment I've seen in a while :)

qianli_cs 3 days ago | parent | prev [-]

The main advantage is the same architectural benefit DBOS provides in other languages: you only need to deploy your application, so there's no separate coordinator to run. All functionality (checkpointing, durable queues, notification/signaling, etc) is built directly into the Go package on top of the database.

vjerancrnjak 3 days ago | parent | prev | next [-]

Any programming language with an effect system could do that as well.

More recently discussed are OCaml's effect system, or Flix programming language.

dirkc 3 days ago | parent | prev [-]

Thanks. I found that very informative!

I also now have the dreadful notion of debugging a non-deterministic deadlock or race condition in a workflow that takes a week to run!

jedberg 3 days ago | parent [-]

The good news is that with durable queues and workflows, you get all the observability you need to make debugging even long running workflows pretty straightforward!

Also, check out the sibling comment for more information about durability.