▲ | lelandbatey 3 days ago | ||||||||||||||||||||||
While this does talk about durable queues, this post really hinges on a pivot to evangelizing durable workflows. It makes it sound like they're synonymous, when they're very much not. Specifically, this sentence bringing up workflows for the first time (emphasis mine): > "Durable queues were rare when I was at Reddit, but they’re more and more popular now. Essentially, they work by combining task queues with durable workflows, helping you reliably orchestrate workflows of many parallel tasks." That makes it sound like task queues + durable workflows = durable queues, but that's not true at all. A durable queue is literally a queue that doesn't drop messages e.g. during an unexpected shutdown. That's all. Durable workflows are a pretty different thing. A durable queue could be used just like a normal queue, but while you can't build a durable workflow on a normal queue (or at least, it would be a huge pain), a durable queue makes it vastly simpler to build a durable workflow engine. I think the article talks about durable workflows because this is DBOS, a company looking to sell durable workflow services, but also because durable workflows are considered by many to be a kind of "holy grail" of big-business applications as they seem like they can allow for you to write code that's kind of "always running", where the state in memory is persisted to a DB invisibly so that you have to think less about CRUD. The killer app of durable workflows seems to me to writing orchestration code for really long-running processes which have to do lots of distributed stuff, as it allows you to write mostly normal looking code which does things like "wait for this thing to finish, even if that thing will be finished in a week", which is a pretty cool thing to see. What are durable workflows? On the technical side, I'd describe durable workflows as being more like a system of cooperative multitasking where you serialize your state/inputs/outputs to a durable store at each yield/suspension point. Since you're tracking state at yield points and not at the individual instruction level, the workflow engine tracks work state less granularly than traditional single-process computing. Due to the more coarse tracking of units of execution, I think of durable workflows more like async runtimes which serialize their progress. The hidden downside to durable workflows then is that it means you have to write odd-looking code to fit into that custom async runtime. For example, since the unit of execution is coarse, you have to assume the code between the checkpoints could potentially be run multiple times if e.g. a worker only gets halfway through executing the next "chunk" but shuts down unexpectedly before finishing. Thus you have to assume at-least-once execution instead of our typical "exactly once" execution when thinking about single lines of code. Additionally, since while some languages are built to support custom async runtimes, even the ones which do don't have sufficient flexibility to allow language-level support for the extremely weird distributed async runtimes you'd need to build a durable workflow engine. Because of that, once you get down to it you're basically going to have build your code out of callbacks that you register with the custom workflow-engine library of the provider you're using. This is the biggest wart of building on durable workflow platforms, as they pretty much all have you write code that looks like this:
That's another detail left out of the parent article: there are actually a ton of these durable workflow platforms, of which DBOS is only one. I think the biggest in the online space is probably Temporal (the one I'm currently using at $DAYJOB), but there's others as well. Here's a short list.- Temporal https://temporal.io/ - DBOS https://www.dbos.dev/ - Inngest https://www.inngest.com/uses/durable-workflows - Restate https://restate.dev/ Anyway, thanks for coming to my TED talk, I hope you've learned about this fascinating developing corner of software, and I can't wait for someone to build first-party language support for pluggable durable execution runtimes into languages we like. Then we can get rid of callback nonsense and start a whole NEW hype cycle around this technology! | |||||||||||||||||||||||
▲ | jedberg 3 days ago | parent | next [-] | ||||||||||||||||||||||
> I think the biggest in the online space is probably Temporal (the one I'm currently using at $DAYJOB), but there's others as well. The reason none of the others were mentioned is because they all work very differently than DBOS. All of those others require an external durability coordinator, and require you to rewrite your application to work around how they operate. DBOS is a library that does its durability work in process and uses the application database to store the durability state. This means the latency is much smaller, and the reliability is much higher because there aren't extra moving parts in the critical path that can go down. Here is a page about this difference: https://docs.dbos.dev/architecture | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | qianli_cs 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||
Thanks for sharing your insights! You nailed the key tradeoffs of most durable workflow systems. The callback-style programming model is exactly the pain point we aim to solve with DBOS. Instead of forcing you into a custom async runtime, DBOS lets you keep writing normal functions (this is an example in Python):
Under the hood, DBOS checkpoints inputs/outputs so it can recover after failure, but you don't have to restructure your code around callbacks. In Python and Java we use decorators/annotations so registration feels natural, while in Go/TypeScript there's a lightweight one-time registration step. Either way, you keep the synchronous call style you'd expect.On top of that, DBOS also supports running workflows asynchronously or through queues, so you can start with a simple function call and later scale out to async/queued execution without changing your code. That's what the article was leading into. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | vjerancrnjak 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||
Any programming language with an effect system could do that as well. More recently discussed are OCaml's effect system, or Flix programming language. | |||||||||||||||||||||||
▲ | dirkc 3 days ago | parent | prev [-] | ||||||||||||||||||||||
Thanks. I found that very informative! I also now have the dreadful notion of debugging a non-deterministic deadlock or race condition in a workflow that takes a week to run! | |||||||||||||||||||||||
|