Remix.run Logo
Show HN: Hatchet v1 – A task orchestration platform built on Postgres(github.com)
212 points by abelanger a day ago | 67 comments

Hey HN - this is Alexander from Hatchet. We’re building an open-source platform for managing background tasks, using Postgres as the underlying database.

Just over a year ago, we launched Hatchet as a distributed task queue built on top of Postgres with a 100% MIT license (https://news.ycombinator.com/item?id=39643136). The feedback and response we got from the HN community was overwhelming. In the first month after launching, we processed about 20k tasks on the platform — today, we’re processing over 20k tasks per minute (>1 billion per month).

Scaling up this quickly was difficult — every task in Hatchet corresponds to at minimum 5 Postgres transactions and we would see bursts on Hatchet Cloud instances to over 5k tasks/second, which corresponds to roughly 25k transactions/second. As it turns out, a simple Postgres queue utilizing FOR UPDATE SKIP LOCKED doesn’t cut it at this scale. After provisioning the largest instance type that CloudSQL offers, we even discussed potentially moving some load off of Postgres in favor of something trendy like Clickhouse + Kafka.

But we doubled down on Postgres, and spent about 6 months learning how to operate Postgres databases at scale and reading the Postgres manual and several other resources [0] during commutes and at night. We stuck with Postgres for two reasons:

1. We wanted to make Hatchet as portable and easy to administer as possible, and felt that implementing our own storage engine specifically on Hatchet Cloud would be disingenuous at best, and in the worst case, would take our focus away from the open source community.

2. More importantly, Postgres is general-purpose, which is what makes it both great but hard to scale for some types of workloads. This is also what allows us to offer a general-purpose orchestration platform — we heavily utilize Postgres features like transactions, SKIP LOCKED, recursive queries, triggers, COPY FROM, and much more.

Which brings us to today. We’re announcing a full rewrite of the Hatchet engine — still built on Postgres — together with our task orchestration layer which is built on top of our underlying queue. To be more specific, we’re launching:

1. DAG-based workflows that support a much wider array of conditions, including sleep conditions, event-based triggering, and conditional execution based on parent output data [1].

2. Durable execution — durable execution refers to a function’s ability to recover from failure by caching intermediate results and automatically replaying them on a retry. We call a function with this ability a durable task. We also support durable sleep and durable events, which you can read more about here [2]

3. Queue features such as key-based concurrency queues (for implementing fair queueing), rate limiting, sticky assignment, and worker affinity.

4. Improved performance across every dimension we’ve tested, which we attribute to six improvements to the Hatchet architecture: range-based partitioning of time series tables, hash-based partitioning of task events (for updating task statuses), separating our monitoring tables from our queue, buffered reads and writes, switching all high-volume tables to use identity columns, and aggressive use of Postgres triggers.

We've also removed RabbitMQ as a required dependency for self-hosting.

We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.

[0] https://www.postgresql.org/docs/

[1] https://docs.hatchet.run/home/conditional-workflows

[2] https://docs.hatchet.run/home/durable-execution

drdaeman 2 hours ago | parent | next [-]

Looks nice on the first glance, congrats on the launch! May I ask a few questions, please?

- Does it support durable tasks that should be essentially ran forever and produce an endless "stream" of events, self-healing in case of intermittent failures? Or would those be a better fit for some different kind of orchestrator?

- Where and how task inputs and outputs are stored? Are there any conveniences to make passing "weird" (that is, not some simple and reasonably-small JSON-encoded objects) things around easier (like Dagster's I/O managers) or is it all out of scope for Hatchet?

- Assuming that I can get ballpark estimates for the desirable number of tasks, their average input and output sizes, and my PostgreSQL instance's size and I/O metrics, can I somehow make a reasonable guesstimate on how many tasks per second the whole system can put through safely?

I'm currently in search of the Holy Grail (haha), evaluating all sorts of tools (Temporal, Dagster, Prefect, Faust, now looking at Hatchet) to find something that I would like the most. My project is a synchronization+processing system that has a bunch of dynamically-defined workflows that continuously work with external services (stores), look for updates (determine new, updated, or deleted products) and spawn product-level workflows to process those updates (standardize store-specific data into an unified shape, match against the canonical product catalog, etc etc). Surely, this kind of a pipeline can be built on nearly anything - I'm just trying to get a gist of how each of those system feels like to work with, what it's actually good at and what are the gotchas and limitations, and which tool would allow me to have least amount of boilerplate.

Thanks!

followben 14 hours ago | parent | prev | next [-]

How does this compare to other pg-backed python job runners like Procrastinate [0] or Chancy [1]?

[0] https://github.com/procrastinate-org/procrastinate/

[1] https://github.com/TkTech/chancy

gabrielruttner 10 hours ago | parent | next [-]

Gabe here, one of the hatchet founders. I'm not very familiar with these runner so someone please correct me if I missed something.

These look like great projects to get something running quickly, but likely will experience many of the challenges Alexander mentioned under load. They look quite similar to our initial implementation using FOR UPDATE and maintaining direct connections from workers to PostgreSQL instead of a central orchestrator (a separate issue that deserves its own post).

One of the reasons for this decision to performantly support more complex scheduling requirements and durable execution patterns -- things like dynamic concurrency [0] or rate limits [1] which can be quite tricky to implement on a worker-pull model where there will likely be contention on these orchestration tables.

They also appear to be pure queues to run individual tasks in python only. We've been working hard on our py, ts, and go sdks

I'm excited to see how these projects approach these problems over time!

[0] https://docs.hatchet.run/home/concurrency [1] https://docs.hatchet.run/home/rate-limits

TkTech 4 hours ago | parent [-]

Chancy dev here.

I've intentionally chosen simple over performance when the choice is there. Chancy still happily handles millions of jobs and workflows a day with dynamic concurrency and global rate limits, even in low-resource environments. But it would never scale horizontally to the same level you could achieve with RabbitMQ, and it's not meant for massive multi-tenant cloud hosting. It's just not the project's goal.

Chancy's aim is to be the low dependency, low infrastructure option that's "good enough" for the vast majority of projects. It has 1 required package dependency (the postgres driver) and 1 required infrastructure dependency (postgres) while bundling everything inside a single ASGI-embeddable process (no need for separate processes like flower or beat). It's used in many of my self-hosted projects, and in a couple of commercial projects to add ETL workflows, rate limiting, and observability to projects that were previously on Celery. Going from Celery to Chancy is typically just replacing your `delay()/apply_async()` with `push()` and swapping `@shared_task()` with `@job()`.

If you have hundreds of employees and need to run hundreds of millions of jobs a day, it's never going to be the right choice - go with something like Hatchet. Chancy's for teams of one to dozens that need a simple option while still getting things like global rate limits and workflows.

wcrossbow 13 hours ago | parent | prev | next [-]

Another good one is pgqueuer https://github.com/janbjorge/pgqueuer

INTPenis 14 hours ago | parent | prev [-]

Celery also has postgres backend, but I maybe it's not as well integrated.

igor47 13 hours ago | parent [-]

It's just a results backend, you still have to run rabbitmq or redis as a broker

stephen 7 hours ago | parent | prev | next [-]

Do queue operations (enqueue a job & mark this job as complete) happen in the same transaction as my business logic?

Imo that's the killer feature of database-based queues, because it dramatically simplifies reasoning about retries, i.e. "did my endpoint logic commit _and_ my background operation enqueue both atomically commit, or atomically fail"?

Same thing for performing jobs, if my worker's business logic commits, but the job later retries (b/c marking the job as committed is a separate transaction), then oof, that's annoying.

And I might as well be using SQS at that point.

williamdclt an hour ago | parent | next [-]

My understanding is that hatchet isn’t just a queue, it’s a workflow orchestrator: you can use it as a queue but it’s kind of like using a computer as a calculator: it works but indeed it’d likely be simpler to use a calculator.

On your point of using transactions for idempotency: you’re right that it’s a great advantage of a db-based queue, but I’d be wary about taking it as a holy grail for a few reasons:

- it locks you into using a db-based queue. If for any reason you don’t want to anymore (eg you’re reaching scalability issues) it’ll be very difficult to switch to another queue system as you’re relying on transactions for idempotency.

- you only get transactional idempotency for db operations. Any other side effect won’t be automatically idempotent: external API calls, sending messages to other queues, writing files…

- if you decide to move some of your domain to another service, you lose transactional idempotency (it’s now two databases)

- relying on transactionality means you’re not resilient to having duplicate tasks in the queue (duplicate publishing). That can easily happen: bug of the publisher, two users triggering an action concurrently… it’s quite often a very normal thing to trigger the same action multiple times

So I’d avoid having my tasks rely on transactionality for idempotency, your system is much more resilient if you don’t

lyu07282 an hour ago | parent | prev [-]

Just no, your tasks should be idempotent. Distributed transactions are stupid.

williamdclt an hour ago | parent [-]

They’re not talking about distributed transactions: it’s not about a task being published and consumed atomically, it’s about it being consumed and executed atomically.

lyu07282 16 minutes ago | parent [-]

the workers aren't talking to postgres directly, thats why you would need distributed transactions.

nik736 6 hours ago | parent | prev | next [-]

The readme assumes users with darkmode outweigh users without (the logo is white, invisible without darkmode). Would be interesting to see stats from Github for this!

diarrhea a day ago | parent | prev | next [-]

This is very exciting stuff.

I’m curious: When you say FOR UPDATE SKIP LOCKED does not scale to 25k queries/s, did you observe a threshold at which it became untenable for you?

I’m also curious about the two points of:

- buffered reads and writes

- switching all high-volume tables to use identity columns

What do you mean by these? Were those (part of) the solution to scale FOR UPDATE SKIP LOCKED up to your needs?

abelanger a day ago | parent [-]

I'm not sure of the exact threshold, but the pathological case seemed to be (1) many tasks in the backlog, (2) many workers, (3) workers long-polling the task tables at approximately the same time. This would consistently lead to very high spikes in CPU and result in a runaway deterioration on the database, since high CPU leads to slower queries and more contention, which leads to higher connection overhead, which leads to higher CPU, and so on. There are a few threads online which documented very similar behavior, for example: https://postgrespro.com/list/thread-id/2505440.

Those other points are mostly unrelated to the core queue, and more related to helper tables for monitoring, tracking task statuses, etc. But it was important to optimize these tables because unrelated spikes on other tables in the database could start getting us into a deteriorated state as well.

To be more specific about the solutions here:

> buffered reads and writes

To run a task through the system, we need to write the task itself, write the instance of that retry of the count to the queue, write an event that the task has been queued, started, completed | failed, etc. Generally one task will correspond to many writes along the way, not all of which need to be extremely latency sensitive. So we started buffering items coming from our internal queues and flushing them once every 10ms, which helped considerably.

> switching all high-volume tables to use identity columns

We originally had combined some of our workflow tables with our monitoring tables -- this table was called `WorkflowRun` and it was used for both concurrency queues and queried when serving the API. This table used a UUID as the primary key, because we wanted UUIDs over the API instead of auto-incrementing IDs. The UUIDs caused some headaches down the line when trying to delete batches of data and prevent index bloat.

chaz6 6 hours ago | parent | next [-]

Out of interest, did you try changing the value of commit_delay? This parameter allows multiple transactions to be written together under heavy load.

nyrikki 2 hours ago | parent [-]

IMHO, with this type of issue is often more likely blowing through the multixact cache or the query planner reverting to SEQSCAN due to the number of locks or mxact id exaustion etc.. It is most likely not a WAL flush problem that commit_delay would help with.

From the above link:[1]

> I found that performing extremely frequent vacuum analyze (every 30 minutes) helps a small amount but this is not that helpful so problems are still very apparent.

> The queue table itself fits in RAM (with 2M hugepages) and during the wait, all the performance counters drop to almost 0 - no disk read or write (semi-expected due to the table fitting in memory) with 100% buffer hit rate in pg_top and row read around 100/s which is much smaller than expected.

Bullet points 2 and 3 from here [2] are what first came to mind, due to the 100% buffer hit rate.

Note that vacuuming every 30min provided "minor improvements" but the worst case of:

     25000 tps * 60sec *30min * 250rows == 11,250,000,000 ID's (assuming worst case every client locking conflicting rows)
Even: 25000tps 60sec 30min

Is only two orders of magnitude away from blowing through the 32bit transaction ID's.

    45,000,000
    4,294,967,296
But XID exhaustion is not as hidden as the MXID exhaustion and will block all writes, while the harder to see MXID exhaustion will only block some writes.

IMHO, if I was writing this, and knowing that you are writing an orchestration platform, getting rid of the long term transactions with just a status column would be better, row level locks are writing to the row anyways, actually twice.

    tuple lock -> write row lock to xmax column -> release tuple lock.
Long lived transactions are always problematic for scaling, and that status column would allow for more recovery options etc...

But to be honest, popping off the left of a red black tree like the linux scheduler does is probably so much better than fighting this IMHO.

This opinion is assuming I am reading this right from the linked to issue [1]

> SELECT FOR UPDATE SKIP LOCKED executes and the select processes wait for multiple minutes (10-20 minutes) before completing

There is a undocumented command pg_get_multixact_members() [3] that can help troubleshoot as many people are using hosted Postgres, the tools too look into the above problems can be limited.

It does appear that Amazon documents a bit about the above here [4].

[1] https://postgrespro.com/list/thread-id/2505440 [2] https://www.postgresql.org/docs/current/routine-vacuuming.ht... [3] https://doxygen.postgresql.org/multixact_8c.html#adf3c97f22b... [4] https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

diarrhea 15 hours ago | parent | prev [-]

Thank you! Very insightful, especially the forum link and the observation around UUIDs bloating indexes.

rohan_ 4 hours ago | parent | prev | next [-]

How close to Postgres does this need to be? Like could you host this on Aurora DSQL and have unlimited scalability?

Or how would you scale this to support thousands of events per second?

ko_pivot 3 hours ago | parent [-]

I’m not the OP but DSQL has such limited Postgres compatibility that it is very unlikely to be compatible.

lysecret 15 hours ago | parent | prev | next [-]

This is awesome and I will take a closer look! One question: We ran into issue with using Postgres as a message queue with messages that need to be toasted/have large payloads (50mb+).

Only fix we could find was using unlogged tables and a full vacuum on a schedule. We aren’t big Postgres experts but since you are I was wondering if you have fixed this issue/this framework works well for large payloads.

igor47 13 hours ago | parent [-]

Don't put them in the queue. Put the large payload into an object store like s3/gcs and put a reference into the db or queue

szvsw 11 hours ago | parent [-]

Yep - this is also the official recommended method by Hatchet, also sometimes called payload thinning.

bluelightning2k 5 hours ago | parent | prev | next [-]

Is this Python only?

More importantly: can this be used to run untrusted jobs? E.g. user-supplied or AI supplied code?

pkiv 4 hours ago | parent | prev | next [-]

Congrats on the launch guys!

kianN 18 hours ago | parent | prev | next [-]

Congratulations on the v1 launch! I’ve been tinkering with hatchet for almost a year, deployed it in production about 6 months ago.

The open source support and QuickStart are excellent. The engineering work put into the system is very noticeable!

morsecodist 18 hours ago | parent | prev | next [-]

This is great timing. I am in the process of designing an event/workflow driven application and nothing I looked at felt quite right for my use case. This feels really promising. Temporal was close but it just felt like not the perfect fit. I like the open source license a lot it gives me more confidence designing an application around it. The conditionals are also great. I have been looking for something just like CEL and despite my research I had never heard of it. It is exactly how I want my expressions implemented, I was on the verge of trying to build something like this myself.

anentropic 6 hours ago | parent | prev | next [-]

Quick feedback:

Would love to see some sort of architecture overview in the docs

The top-level docs have a section on "Deploying workers" but I think there are more components than that?

It's cool there's a Helm chart but the docs don't really say what resources it would deploy

https://docs.hatchet.run/self-hosting/docker-compose

...shows four different Hatchet services plus, unexpectedly, both a Postgres server and RabbitMQ. Can't see anywhere that describes what each one of those does

Also in much of the docs it's not very clear where the boundary between Hatchet Cloud and Hatchet the self-hostable OSS part lies

gabrielruttner 5 hours ago | parent [-]

Thanks for this feedback, we'll add some details and an architecture diagram.

The simplest way to run hatchet is with `hatchet-lite`[0] which bundles all internal services. For most deployments we recommend running these components separately hence the multiple services in the helm chart [1]. RabbitMQ is now an optional dependency which is used for internal-service messages for higher throughput deployments [2].

Your workers are always run as a separate process.

[0] https://docs.hatchet.run/self-hosting/hatchet-lite

[1] https://docs.hatchet.run/self-hosting/improving-performance#...

[2] https://hatchet.run/launch-week-01/pg-only-mode

edit: missed your last question -- currently self-host includes everything in cloud except managed workers

latchkey 17 hours ago | parent | prev | next [-]

Cool project. Every time one of these projects comes up, I'm always somewhat disappointed it isn't an open source / postgres version of GCP Cloud Tasks.

All I ever want is a queue where I submit a message and then it hits an HTTP endpoint with that message as POST. It is such a better system than dedicated long running worker listeners, because then you can just scale your HTTP workers as needed. Pairs extremely well with autoscaling Cloud Functions, but could be anything really.

I also find that DAGs tend to get ugly really fast because it generally involves logic. I'd prefer that logic to not be tied into the queue implementation because it becomes harder to unit test. Much easier reason about if you have the HTTP endpoint create a new task, if it needs to.

abelanger 10 hours ago | parent | next [-]

We actually have support for that, we just haven't migrated the doc over to v1 yet: https://v0-docs.hatchet.run/home/features/webhooks. We'll send a POST request for each task.

> It is such a better system than dedicated long running worker listeners, because then you can just scale your HTTP workers as needed.

This depends on the use-case - with long running listeners, you get the benefit of reusing caches, database connections, and disk, and from a pricing perspective, if your task spends a lot of time waiting for i/o operations (or waiting for an event), you don't get billed separately for CPU time. A long-running worker can handle thousands of concurrently running functions on cheap hardware.

> I also find that DAGs tend to get ugly really fast because it generally involves logic. I'd prefer that logic to not be tied into the queue implementation because it becomes harder to unit test. Much easier reason about if you have the HTTP endpoint create a new task, if it needs to.

We usually recommend that DAGs which require too much logic (particularly fanout to a dynamic amount of workflows) should be implemented as a durable task instead.

latchkey 4 hours ago | parent | next [-]

Thanks for your response. webhooks are literally the last thing documented, which says to me that it isn't a focus for your product at all.

I used to work for a company that used long running listeners. They would more often than not, get into a state where (for example) they would need to upgrade some code and now they had all these long running jobs (some would go for 24 hours!), that if they stopped them, would screw everything up down the line because it would take so long to finish if they restarted them that it would impact customer facing data. Just like DAG's, it sounds good on paper, but it is a terrible design pattern that will eventually bite you in the ass.

The better solution is to divide and conquer. Break things up into smaller units of work and then submit more messages to the queue. This way, you can break at any point and you won't lose hours worth of work. The way to force this to developers, is to set constraints about how long things can execute for. Make them think about what they are building and build idempotency into things.

The fact that you're building a system that supports all these footguns seems terrifying. "Usually recommend" is undesirable, people will always find ways to use things in the way you don't expect it. I'd much rather work with a more constrained system than one trying to be all things to all people. Cloud Tasks does a really good job of just doing one thing well.

gabrielruttner 9 hours ago | parent | prev [-]

Admittedly webhook workers aren't exactly this since we send multiple tasks to the same endpoint, where I believe you can register one endpoint per task with Cloud Task. Although, this is not a large change.

latchkey 4 hours ago | parent [-]

I use a router on my end, so it would always be one endpoint anyway. The problem with Cloud Tasks, is that the more individual tasks you create, the more time it takes to deploy. Better to hide all that behind a single router.

jsmeaton 10 hours ago | parent | prev | next [-]

Cloudtasks are excellent and I’ve been wanting something similar for years.

I’ve been occasionally hacking away at a proof of concept built on riverqueue but have eased off for a while due to performance issues obvious with non-partitioned tables and just general laziness.

https://github.com/jarshwah/dispatchr if curious but it doesn’t actually work yet.

bgentry 2 hours ago | parent [-]

Developer of River here ( https://riverqueue.com ). I'm curious if you ran into actual performance limitations based on specific testing and use cases, or if it's more of a hypothetical concern. Modern Postgres running on modern hardware and with well-written software can handle many thousands or tens of thousands of jobs per second (even without partitioning), albeit that depends on your workload, your tuning / autovacuum settings, and your job retention time.

lysecret 14 hours ago | parent | prev | next [-]

Yea I also like this system only problem I was facing with it was http read will lead to timeouts/lost connections. And task queues specifically have a 30 min execution limit. But I really like how it separates the queueing logic from the whole application/execution graph. Task queues are one of my favourite pieces of cloud infrastructure.

igor47 13 hours ago | parent | prev [-]

How do you deal with cloud tasks in dev/test?

latchkey 4 hours ago | parent | next [-]

Great question.

I built my own super simple router abstraction. Message comes in, goes into my router, which sends it to the right handler.

I only test the handler itself, without any need for the higher level tasks. This also means that I'm only thinly tied to GCP Tasks and can migrate to another system by just changing the router.

jerrygenser 10 hours ago | parent | prev [-]

What we did was mock it to make the http request blocking.

Alternatively you can use ngrok(or similar) and a test task queue that is calling your service running on localhost tunneled via ngrok.

avan1 21 hours ago | parent | prev | next [-]

Don't want to steal your topic but I had written a lightweight task runner to learn GoLang [0]. Would be great to have your and others' comments. It works only as a Go library.

[0] https://github.com/oneapplab/lq

P.S: far from being alternative to Hatchet product

abelanger 20 hours ago | parent [-]

Nice! I haven't looked closely, but some initial questions/comments:

1. Are you ordering the jobs by any parameter? I don't see an ORDER BY in this clause: https://github.com/oneapplab/lq/blob/8c9f8af577f9e0112767eef...

2. I see you're using a UUID for the primary key on the jobs, I think you'd be better served by an auto-inc primary key (bigserial or identity columns in Postgres) which will be slightly more performant. This won't matter for small datasets.

3. I see you have an index on `queue`, which is good, but no index on the rest of the parameters in the processor query, which might be problematic when you have many reserved jobs.

4. Since this is an in-process queue, it would be awesome to allow the tx to be passed to the `Create` method here: https://github.com/oneapplab/lq/blob/8c9f8af577f9e0112767eef... -- so you can create the job in the same tx when you're performing a data write.

avan1 10 hours ago | parent | next [-]

Thanks a lot for the review you did which was much more than i requested. i noted all the 4 comments you did to apply on the package. Thanks again. Also currently we have Laravel backend and Laravel + Redis + Horizon [0] + Supervisor as a queue runner for our production and it's working fine for us. but would be great to can access Hatchet from php as well which we might switch in future as well. Another thing since you mentioned handling large work load do you recommend Hatchet as kafka or Rabbit message queue alternative to microservice communications ?

[0] https://laravel.com/docs/12.x/horizon

someone13 16 hours ago | parent | prev [-]

I just want to say how cool it is to see you doing a non-trivial review of someone else’s thing here

lysecret 14 hours ago | parent | prev | next [-]

I would appreciate a comparison to cloud tasks in your docs.

themanmaran a day ago | parent | prev | next [-]

How does queue observability work in hatchet? I've used pg as a queueing system before, and that was one of my favorite aspects. Just run a few SQL queries to have a dashboard for latency/throughput/etc.

But that requires you to keep the job history around, which at scale starts to impact performance.

abelanger a day ago | parent [-]

Yeah, part of this rewrite was separating our monitoring tables from all of our queue tables to avoid problems like table bloat.

At one point we considered partitioning on the status of a queue item (basically active | inactive) and aggressively running autovac on the active queue items. Then all indexes for monitoring can be on the inactive partitioned tables.

But there were two reasons we ended up going with separate tables:

1. We started to become concerned about partitioning _both_ by time range and by status, because time range partitioning is incredibly useful for discarding data after a certain amount of time

2. If necessary, we wanted our monitoring tables to be able to run on a completely separate database from our queue tables. So we actually store them as completely independent schemas to allow this to be possible (https://github.com/hatchet-dev/hatchet/blob/main/sql/schema/... vs https://github.com/hatchet-dev/hatchet/blob/main/sql/schema/...)

So to answer the question -- you can query both active queues and a full history of queued tasks up to your retention period, and we've optimized the separate tables for the two different query patterns.

szvsw 10 hours ago | parent | prev | next [-]

I’ve been using Hatchet since the summer, and really do love it over celery. I’ve been using Hatchet for academic research experiments with embarrassingly parallel tasks - ie thousands of simultaneous tasks just with different inputs, each CPU bound and on the order of 10s-2min, totaling in the millions of tasks per experiment - and it’s been going great. I think the team is putting together a very promising product. Switching from a roll-my-own SQS+AWS batch system to Hatchet has made my research life so much better. Though part of that also probably comes from the forced improvements you get when re-designing a system a second time.

Although there was support for pydantic validation in v0, now that the v1 SDK has arrived, I would definitely say that the #1 distinguishing feature (at least from a dx perspective) for anyone thinking of switching from Celery or working on a greenfield project is the type safety that comes with the first class pydantic support in v1. That is a huge boon in my opinion.

Another big boon for me was that the combo of both Python and Typescript SDKs - being able to integrate things into frontend demos without having to set up a separate Python api is great.

There are a couple rough edges around asyncio/single worker concurrency IMO - for instance, choosing between 100 workers each with capacity for 8 concurrent task runs vs 800 workers each with capacity for 1 concurrent task run. In Celery it’s a little bit easier to launch a worker node which uses separate processes to handle its concurrent tasks, whereas right now with Hatchet, that’s not possible as far as I am aware, due to how asyncio is used to handle the concurrent task runs which a single worker may be processing. If most of your work is IO bound or already asyncio friendly, this does not really affect you and you can safely use eg a worker with 8x task run capacity, but if you are CPU bound there might be some cases where you would prefer the full process isolation and feel more assured that you are maximally utilizing all your compute in a given node, and right now the best way to do that is only through horizontal scaling or 1x task workers I think. Generally, if you do not have a great mental model already of how Python handles asyncio, threads, pools, etc, the right way to think about this stuff can be a little confusing IMO, but the docs on this from Hatchet have improved. In the future though, I’d love to see an option to launch a Python worker with capacity for multiple simultaneous task runs in separate processes, even if it’s just a thin wrapper around launching separate workers under the hood.

There are also a couple of rough edges in the dashboard right now, but the team has been fixing them, and coming from celery/flower or SQS, it’s already such an improved dashboard/monitoring experience that I can’t complain!

It’s hard to describe, but there is just something fun about working with Hatchet for me, compared to Celery or my previous SQS system. Almost all of the design decision just align with what I would desire, and feel natural.

hyuuu a day ago | parent | prev | next [-]

i have been looking for something like this, the closest I could find by googling was celery workflow, i think you should do better marketing, I didn't even realize that hatchet existed!

programmarchy 3 hours ago | parent | prev | next [-]

Wow, this looks awesome. Been using Temporal, but this fits so perfectly into my stack (Postgres, Pydantic), and the first-class support for DAG workflows is chef's kiss. Going to take a stab at porting over some of my workflows.

digdugdirk a day ago | parent | prev | next [-]

Interesting! How does it compare with DBOS? I noticed it's not in the readme comparisons, and they seem to be trying to solve a similar problem.

KraftyOne 21 hours ago | parent | next [-]

(DBOS co-founder here) From a DBOS perspective, the biggest differences are that DBOS runs in-process instead of on an external server, and DBOS lets you write worklflows as code instead of explicit DAGs. I'm less familiar with Hatchet, but here's a blog post comparing DBOS with Temporal, which also uses external orchestration for durable execution: https://www.dbos.dev/blog/durable-execution-coding-compariso...

abelanger 20 hours ago | parent [-]

> and DBOS lets you write worklflows as code instead of explicit DAGs

To clarify, Hatchet supports both DAGs and workflows as code: see https://docs.hatchet.run/home/child-spawning and https://docs.hatchet.run/home/durable-execution

abelanger a day ago | parent | prev [-]

Yep, durable execution-wise we're targeting a very similar use-case with a very different philosophy on whether the orchestrator (the part of the durable execution engine which invokes tasks) should run in-process or as a separate service.

There's a lot to go into here, but generally speaking, running an orchestrator as a separate service is easier from a Postgres scaling perspective: it's easier to buffer writes to the database, manage connection overhead, export aggregate metrics, and horizontally scale the different components of the orchestrator. Our original v0 engine was architected in a very similar way to an in-process task queue, where each worker polls a tasks table in Postgres. This broke down for us as we increasing volume.

Outside of durable execution, we're more of a general-purpose orchestration platform -- lots of our features target use-cases where you either want to run a single task or define your tasks as a DAG (directed acyclic graph) instead of using durable execution. Durable execution has a lot of footguns if used incorrectly, and DAGs are executed in a durable way by default, so for many use-cases it's a better option.

darkteflon a day ago | parent [-]

Hatchet looks very cool! As an interested dilettante in this space, I’d love to read a comparison with Dagster.

Re DBOS: I understood that part of the value proposition there is bundling transactions into logical units that can all be undone if a critical step in the workflow fails - the example given in their docs being a failed payment flow. Does Hatchet have a solution for those scenarios?

abelanger a day ago | parent [-]

Re DBOS - yep, this is exactly what the child spawning feature is meant for: https://docs.hatchet.run/home/child-spawning

The core idea being that you write the "parent" task as a durable task, and you invoke subtasks which represent logical units of work. If any given subtask fails, you can wrap it in a `try...catch` and gracefully recover.

I'm not as familiar with DBOS, but in Hatchet a durable parent task and child task maps directly to Temporal workflows and activities. Admittedly this pattern should be documented in the "Durable execution" section of our docs as well.

Re Dagster - Dagster is much more oriented towards data engineering, while Hatchet is oriented more towards application engineers. As a result tools like Dagster/Airflow/Prefect are more focused on data integrations, whereas we focus more on throughput/latency and primitives that work well with your application. Perhaps there's more overlap now that AI applications are more ubiquitous? (with more data pipelines making their way into the application layer)

darkteflon a day ago | parent [-]

Perfect - great answer and very helpful, thanks.

wilted-iris a day ago | parent | prev | next [-]

This looks very cool! I see a lot of Python in the docs; is it usable in other languages?

abelanger a day ago | parent [-]

Thanks! There are SDKs for Python, Typescript and Go. We've gotten a lot of requests for other SDKs which we're tracking here: https://github.com/hatchet-dev/hatchet/discussions/436

throwaway9w4 a day ago | parent [-]

Is there any documentation of the api, so that someone can call it directly without going through the sdk?

abelanger a day ago | parent [-]

We use gRPC on our workers. All API specs can be found here: https://github.com/hatchet-dev/hatchet/tree/main/api-contrac...

However, the SDKs are very tightly integrated with the runtime in each language, and we use gRPC on the workers which will make it more difficult to call the APIs directly.

bomewish a day ago | parent | prev | next [-]

Why not fix all the broken doc links and make sure you have the full sdk spec down first, ready to go? Then drop it all at once, when it’s actually ready. That’s better and more respectful of users. I love the product and want y’all to succeed but this came off as extremely unprofessional.

abelanger a day ago | parent [-]

Really appreciate the candid feedback, and glad to hear you like the product. We ran a broken links checker against our docs, but it's possible we missed something. Is there anywhere you're seeing a broken link?

Re SDK specs -- I assume you mean full SDK API references? We're nearly at the point where those will be published, and I agree that they would be incredibly useful.

krainboltgreene 16 hours ago | parent | prev | next [-]

A lot of these tools show off what a full success backlog looks like, in reality I care significantly more about what failure looks like, debugging, etc.

lysecret 14 hours ago | parent [-]

Ha this is a really good point! I worked with so many different kinds of observability approaches and always fell back to traced logs. This might be part of the reason.

revskill 16 hours ago | parent | prev [-]

Confusing docs as there is no setup self hosted for postgres.

abelanger 10 hours ago | parent [-]

Hey there - all of our self-hosting docs show you how to set up Postgres: https://docs.hatchet.run/self-hosting

Would love to hear more about what you found confusing!