At that scale, Jetstream should work well. In my experience, Jetstream's main performance weakness is the per-stream/consumer overhead: Too many and NATS ends up running too hot due to all the state updates and Raft traffic. (Each stream is a Raft group, but so is each consumer.)

If its tens of TB in a stream, then I've not personally stored that much data in a stream, but I don't see why it wouldn't handle it. Note that Jetstream has a maximum message size of 1MB (this is because Jetstream uses NATS for its client/server protocol, which has that limit), which was a real problem for one use case I had. Redpanda has essentially no upper limit.

Note that number of NATS servers isn't the same as the replication factor. You can have 3 servers and a replication factor of 2 if you want, which allows more flexibility. Both consumers and streams have their own replication factors.

The other option I have considered in the past is EMQX, which is a clustered MQTT system written in Erlang. It looks nice, but I've never used it in production, and it's one of those projects that nobody seems to be talking about, at least not in my part of the industry.

▲

pdimitar 3 days ago | parent [-]

Well I work mainly with Elixir in the last 10-ish years (with a lot of Rust and some Golang here and there) so EMQX would likely be right up my alley.

Do you have any other recommendations? The time is right for us and I'll soon start evaluating. I only have NATS Jestream and MQTT on my radar so far.

Kafka I already used and rejected for the reasons above ("entire batch or nothing / later").

As for data, I meant tens of terabytes of traffic on busy days, sorry. Most of the time it's a few hundred gigs. (Our area is prone to spikes and the business hours matter a lot.) And again, that's total traffic. I don't think we'd have more than 10-30GB stored in our queue system, ever. Our background workers aggressively work through the backlog and chew data into manageable (and much smaller chunks) 24/7.

And as one of the seniors I am extremely vigilant of payload sizes. I had to settle on JSON for now but I push back, hard, on any and all extra data; anything and everything that can be loaded from DB or even caches is delegated as such with various IDs -- this also helps us with e.g. background jobs that are no longer relevant as certain entity's state moved too far forward due to user interaction and the enriching job no longer needs to run; when you have only references in your message payload, this enables and even forces the background job to load data exactly at the time of its run and not assume a potentially outdated state.

Anyhow, I got chatty. :)

Thank you. If you have other recommendations, I am willing to sacrifice a little weekend time to give them a cursory research. Again, utmost priority is 100% durability (as much as that is even possible of course) and mega ultra speed is not of the essence. We'll never have even 100 consumers per stream; I haven't ever seen more than 30 in our OTel tool dashboard.

EDIT: I should also say that our app does not have huge internal traffic; it's a lot (Python wrappers around AI / OCR / others is one group of examples) but not huge. As such, our priorities for a message queue are just "be super reliable and be able to handle an okay beating and never lose stuff" really. It's not like in e.g. finance where you might have dozens of Kafka clusters and workers that hand off data from one Kafka queue to another with a ton of processing in the meantime. We are very far from that.

	▲	atombender 3 days ago \| parent [-]
		Those are the two I can think of. Jetstream is written in Go and the Go SDK is very mature, and has all the support one needs to create streams and consumers; never used it from Elixir, though. EMQX's Go support looks less good (though since it's MQTT you can use any MQTT client). Regarding data reliability, I've never lost production data with Jetstream. But I've had some odd behaviour locally where everything has just disappeared suddenly. I would be seriously anxious if I had TBs of stream data I couldn't afford to lose, and no way to regenerate it easily. It's possible to set up a consumer that backs up everything to (say) cloud storage, just in case. You can use Benthos to set up such a pipeline. I think I'd be less anxious with Kafka or Redpanda because of their reputation in being very solid. Going back to the "whole batch or nothing", I do see this as a good thing myself. It means you are always processing in exact order. If you have to reject something, the "right" approach is an explicit dead-letter topic — you can still consume that one from the same consumer. But it makes the handling very explicit. With Jetstream, you do have an ordered stream, but the broker also tracks acid/nacks, which adds complexity. You get nacks even if you never do it manually; all messages have a configurable ack deadline, and if your consumer is too slow, the message will be automatically bounced. (The ack delay also means if a client crashes, the message will sit in the broker for up to the ack delay before it gets delivered to another consumer.) But of course, this is super convenient, too. You can write simpler clients, and the complicated stuff is handled by the broker. But having written a lot of these pipelines, my philosophy these days is that — at least for "this must not be allowed to fail" processing, I prefer something that is explicit and simpler and less magical, even if it's a bit less convenient to write code for it. Just my 2 cents! This is getting a bit long. Please do reach out (my email is in my profile) if you want to chat more!