▲ | atombender 3 days ago | |||||||
At that scale, Jetstream should work well. In my experience, Jetstream's main performance weakness is the per-stream/consumer overhead: Too many and NATS ends up running too hot due to all the state updates and Raft traffic. (Each stream is a Raft group, but so is each consumer.) If its tens of TB in a stream, then I've not personally stored that much data in a stream, but I don't see why it wouldn't handle it. Note that Jetstream has a maximum message size of 1MB (this is because Jetstream uses NATS for its client/server protocol, which has that limit), which was a real problem for one use case I had. Redpanda has essentially no upper limit. Note that number of NATS servers isn't the same as the replication factor. You can have 3 servers and a replication factor of 2 if you want, which allows more flexibility. Both consumers and streams have their own replication factors. The other option I have considered in the past is EMQX, which is a clustered MQTT system written in Erlang. It looks nice, but I've never used it in production, and it's one of those projects that nobody seems to be talking about, at least not in my part of the industry. | ||||||||
▲ | pdimitar 3 days ago | parent [-] | |||||||
Well I work mainly with Elixir in the last 10-ish years (with a lot of Rust and some Golang here and there) so EMQX would likely be right up my alley. Do you have any other recommendations? The time is right for us and I'll soon start evaluating. I only have NATS Jestream and MQTT on my radar so far. Kafka I already used and rejected for the reasons above ("entire batch or nothing / later"). As for data, I meant tens of terabytes of traffic on busy days, sorry. Most of the time it's a few hundred gigs. (Our area is prone to spikes and the business hours matter a lot.) And again, that's total traffic. I don't think we'd have more than 10-30GB stored in our queue system, ever. Our background workers aggressively work through the backlog and chew data into manageable (and much smaller chunks) 24/7. And as one of the seniors I am extremely vigilant of payload sizes. I had to settle on JSON for now but I push back, hard, on any and all extra data; anything and everything that can be loaded from DB or even caches is delegated as such with various IDs -- this also helps us with e.g. background jobs that are no longer relevant as certain entity's state moved too far forward due to user interaction and the enriching job no longer needs to run; when you have only references in your message payload, this enables and even forces the background job to load data exactly at the time of its run and not assume a potentially outdated state. Anyhow, I got chatty. :) Thank you. If you have other recommendations, I am willing to sacrifice a little weekend time to give them a cursory research. Again, utmost priority is 100% durability (as much as that is even possible of course) and mega ultra speed is not of the essence. We'll never have even 100 consumers per stream; I haven't ever seen more than 30 in our OTel tool dashboard. EDIT: I should also say that our app does not have huge internal traffic; it's a lot (Python wrappers around AI / OCR / others is one group of examples) but not huge. As such, our priorities for a message queue are just "be super reliable and be able to handle an okay beating and never lose stuff" really. It's not like in e.g. finance where you might have dozens of Kafka clusters and workers that hand off data from one Kafka queue to another with a ton of processing in the meantime. We are very far from that. | ||||||||
|