Does anyone use https://nats.io here? I have heard good things about it. I would love to hear about the comparisons between nats.io and kafka

▲

atombender 4 days ago | parent | next [-]

NATS is very good. It's important to distinguish between core NATS and Jetstream, however.

Core NATS is an ephemeral message broker. Clients tell the server what subjects they want messages about, producers publish. NATS handles the routing. If nobody is listening, messages go nowhere. It's very nice for situations where lots of clients come and go. It's not reliable; it sheds messages when consumers get slow. No durability, so when a consumer disconnects, it will miss messages sent in its absence. But this means it's very lightweight. Subjects are just wildcard paths, so you can have billions of them, which means RPC is trivial: Send out a message and tell the receiver to post a reply to a randomly generated subject, then listen to that subject for the answer.

NATS organizes brokers into clusters, and clusters can form hub/spoke topologies where messages are routed between clusters by interest, so it's very scalable; if your cluster doesn't scale to the number of consumers, you can add another cluster that consumes the first cluster, and now you have two hubs/spokes. In short, NATS is a great "message router". You can build all sorts of semantics on top of it: RPC, cache invalidation channels, "actor" style processes, traditional pub/sub, logging, the sky is the limit.

Jetstream is a different technology that is built on NATS. With Jetstream, you can create streams, which are ordered sequences of messages. A stream is durable and can have settings like maximum retention by age and size. Streams are replicated, with each stream being a Raft group. Consumers follow from a position. In many ways it's like Kafka and Redpanda, but "on steroids", superficially similar but just a lot richer.

For example, Kafka is very strict about the topic being a sequence of messages that must be consumed exactly sequentially. If the client wants to subscribe to a subset of events, it must either filter client-side, or you have some intermediary that filters and writes to a topic that the consumer then consumes. With NATS, you can ask the server to filter.

Unlike Kafka, you can also nack messages; the server keeps track of what consumers have seen. Nacking means you lose ordering, as the nacked messages come back later. Jetstream also supports a Kafka-like strictly ordered mode. Unlike Kafka, clients can choose the routing behaviour, including worker style routing and deterministic partitioning.

Unlike Kafka's rigid networking model (consumers are assigned partitions and they consume the topic and that's it), as with NATS, you can set up complex topologies where streams get gatewayed and replicated. For example, you can streams in multiple regions, with replication, so that consumers only need to connect to the local region's hub.

While NATS/Jetstream has a lot of flexibility, I feel like they've compromised a bit on performance and scalability. Jetstream clusters don't scale to many servers (they recommend max 3, I think) and large numbers of consumers can make the server run really hot. I would also say that they made a mistake adopting nacking into the consuming model. The big simplification Kafka makes is that topics are strictly sequential, both for producing and consuming. This keeps the server simpler and forces the client to deal with unprocessable messages. Jetstream doesn't allow durable consumers to be strictly ordered; what the SDK calls an "ordered consumer" is just an ephemeral consumer. Furthermore, ephemeral consumers don't really exist. Every consumer will create server-side state. In our testing, we found that having more than a few thousand consumers is a really bad idea. (The newest SDK now offers a "direct fetch" API where you can consume a stream by position without registering a server-side consumer, but I've not yet tried it.)

Lastly, the mechanics of the server replication and connectivity is rather mysterious, and it's hard to understand when something goes wrong. And with all the different concepts — leaf nodes, leaf clusters, replicas, mirrors, clusters, gateways, accounts, domains, and so on — it's not easy to understand the best way to design a topology. The Kafka network model, by comparison, is very simple and straightforward, even if it's a lot less flexible. With Kafka, you can still build hub/spoke topologies yourself by reading from topics and writing to other topics, and while it's something you need to set up yourself, it's less magical, and easier to control and understand.

Where I work, we have used NATS extensively with great success. We also adopted Jetstream for some applications, but we've soured on it a bit, for the above reasons, and now use Redpanda (which is Kafka-compatible) instead. I still think JS is a great fit for certain types of apps, but I would definitely evaluate the requirements carefully first. Jetstream is different enough that it's definitely not just a "better Kafka".

▲

shikhar 4 days ago | parent | next [-]

> Jetstream clusters don't scale to many servers (they recommend max 3, I think)

Jetstream is even more limited than most Kafkas on number of streams https://github.com/nats-io/nats-server/discussions/5128#disc...

▲

pdimitar 3 days ago | parent | prev [-]

That's an amazing analysis, thank you so much!

What are your impressions of Red panda?

We're particularly interested in NATS' feature of working with individual messages and have been bitten by Kafka's "either process the entire batch or put it back for later processing", which doesn't work for our needs.

Interested if Redpanda is doing better than either.

▲

atombender 3 days ago | parent [-]

Redpanda is fantastic, but it has the exact same message semantics as Kafka. They don't even have their own client; you connect using the Kafka protocol. Very happy with it, but it does have the same "whole batch or nothing" approach.

NATS/Jetstream is amazing if it fits your use case and you don't need extreme scalability. As I said before, it offers a lot more flexibility. You can process a stream sequentially but also nack messages, so you get the best of both worlds. It has deduping (new messages for the same subject will mark older ones as deleted) and lots of other convenience goodies.

▲

pdimitar 3 days ago | parent [-]

Thank you so much again. Yes, we are not Google scale, our main priority is durability and scalability but only up to a (I'd say fairly modest) point. I.e. be able to have one beefy NATS server do it all and only add a second one when things start getting bad. Even 3 servers we'd see as a strategic defeat + we have data but again, very far from Google scale.

We've looked at Redis streams but me and a few others are skeptical as Redis is not known for good durability practices (talking about the past; I've no idea if they pivoted well in the last years) and sadly none of us has any experience with MQTT -- though we heard tons of praise on that one.

But our story is: some tens of terabytes of data, no more than a few tens of millions of events / messages a day, aggressive folding of data in multiple relational DBs, and a very dynamic and DB-heavy UI (I will soon finish my Elixir<=>Rust SQLite3 wrapper so we're likely going to start sharding the DB-intensive customer data to separate SQLite3 databases and I'm looking forward to spearheading this effort; off-topic). For our needs NATS Jetstream sounds like the exactly perfect fit, though time will tell.

I still have the nagging feeling of missing out on still having not tried MQTT though...

▲

atombender 3 days ago | parent [-]

At that scale, Jetstream should work well. In my experience, Jetstream's main performance weakness is the per-stream/consumer overhead: Too many and NATS ends up running too hot due to all the state updates and Raft traffic. (Each stream is a Raft group, but so is each consumer.)

If its tens of TB in a stream, then I've not personally stored that much data in a stream, but I don't see why it wouldn't handle it. Note that Jetstream has a maximum message size of 1MB (this is because Jetstream uses NATS for its client/server protocol, which has that limit), which was a real problem for one use case I had. Redpanda has essentially no upper limit.

Note that number of NATS servers isn't the same as the replication factor. You can have 3 servers and a replication factor of 2 if you want, which allows more flexibility. Both consumers and streams have their own replication factors.

The other option I have considered in the past is EMQX, which is a clustered MQTT system written in Erlang. It looks nice, but I've never used it in production, and it's one of those projects that nobody seems to be talking about, at least not in my part of the industry.

▲

pdimitar 3 days ago | parent [-]

Well I work mainly with Elixir in the last 10-ish years (with a lot of Rust and some Golang here and there) so EMQX would likely be right up my alley.

Do you have any other recommendations? The time is right for us and I'll soon start evaluating. I only have NATS Jestream and MQTT on my radar so far.

Kafka I already used and rejected for the reasons above ("entire batch or nothing / later").

As for data, I meant tens of terabytes of traffic on busy days, sorry. Most of the time it's a few hundred gigs. (Our area is prone to spikes and the business hours matter a lot.) And again, that's total traffic. I don't think we'd have more than 10-30GB stored in our queue system, ever. Our background workers aggressively work through the backlog and chew data into manageable (and much smaller chunks) 24/7.

And as one of the seniors I am extremely vigilant of payload sizes. I had to settle on JSON for now but I push back, hard, on any and all extra data; anything and everything that can be loaded from DB or even caches is delegated as such with various IDs -- this also helps us with e.g. background jobs that are no longer relevant as certain entity's state moved too far forward due to user interaction and the enriching job no longer needs to run; when you have only references in your message payload, this enables and even forces the background job to load data exactly at the time of its run and not assume a potentially outdated state.

Anyhow, I got chatty. :)

Thank you. If you have other recommendations, I am willing to sacrifice a little weekend time to give them a cursory research. Again, utmost priority is 100% durability (as much as that is even possible of course) and mega ultra speed is not of the essence. We'll never have even 100 consumers per stream; I haven't ever seen more than 30 in our OTel tool dashboard.

EDIT: I should also say that our app does not have huge internal traffic; it's a lot (Python wrappers around AI / OCR / others is one group of examples) but not huge. As such, our priorities for a message queue are just "be super reliable and be able to handle an okay beating and never lose stuff" really. It's not like in e.g. finance where you might have dozens of Kafka clusters and workers that hand off data from one Kafka queue to another with a ton of processing in the meantime. We are very far from that.

	▲	atombender 3 days ago \| parent [-]
		Those are the two I can think of. Jetstream is written in Go and the Go SDK is very mature, and has all the support one needs to create streams and consumers; never used it from Elixir, though. EMQX's Go support looks less good (though since it's MQTT you can use any MQTT client). Regarding data reliability, I've never lost production data with Jetstream. But I've had some odd behaviour locally where everything has just disappeared suddenly. I would be seriously anxious if I had TBs of stream data I couldn't afford to lose, and no way to regenerate it easily. It's possible to set up a consumer that backs up everything to (say) cloud storage, just in case. You can use Benthos to set up such a pipeline. I think I'd be less anxious with Kafka or Redpanda because of their reputation in being very solid. Going back to the "whole batch or nothing", I do see this as a good thing myself. It means you are always processing in exact order. If you have to reject something, the "right" approach is an explicit dead-letter topic — you can still consume that one from the same consumer. But it makes the handling very explicit. With Jetstream, you do have an ordered stream, but the broker also tracks acid/nacks, which adds complexity. You get nacks even if you never do it manually; all messages have a configurable ack deadline, and if your consumer is too slow, the message will be automatically bounced. (The ack delay also means if a client crashes, the message will sit in the broker for up to the ack delay before it gets delivered to another consumer.) But of course, this is super convenient, too. You can write simpler clients, and the complicated stuff is handled by the broker. But having written a lot of these pipelines, my philosophy these days is that — at least for "this must not be allowed to fail" processing, I prefer something that is explicit and simpler and less magical, even if it's a bit less convenient to write code for it. Just my 2 cents! This is getting a bit long. Please do reach out (my email is in my profile) if you want to chat more!

▲

nchmy 4 days ago | parent | prev | next [-]

I dont have kafka experience, but nats is absolutely amazing. Just a complete pleasure to use, in every way.

https://www.synadia.com/blog/nats-and-kafka-compared

▲

dijit 4 days ago | parent | prev | next [-]

I got really pissed off with their field CTO for essentially trying to pull the wool over my eyes regarding performance and reliability.

Essentially their base product (NATs) has a lot of performance but trades it off for reliability. So they add Jetstream to NATs to get reliability, but use the performance numbers of pure NATs.

I got burned by MongoDB for doing this to me, I won’t work with any technology that is marketed in such a disingenuous way again.

▲

AtlasBarfed 4 days ago | parent | next [-]

Don't implement any distributive technology until aphyr has put it through the paces, and even then... Pilot

	▲	munksbeer 3 days ago \| parent [-]
		https://aphyr.com/about "Unavailable Due to the UK Online Safety Act" :(

▲

nchmy 4 days ago | parent | prev [-]

You mean Jetstream?

Can you point to where they are using core NATS numbers to describe Jetstream?

▲

dijit 4 days ago | parent [-]

Yes, I meant Jetstream (I even typed it but second guessed myself, my mistake) I’m typing these when I get a moment as I’m at a wedding- so I apologise.

The issue in the docs was that there are no available Jetstream numbers, so I talked over a video call to the field CTO, who cited the base NATs numbers to me, and when I pressed him on if it was with Jetstream he said that it was without: so I asked for them with Jetstream enabled and he cited the same numbers back to me. Even when I pressed him again that “you just said those numbers are without Jetstream” he said that it was not an issue.

So, I got a bit miffed after the call ended, we spent about 45 minutes on the call and this was the main reason to have the call in the first place so I am a bit bent about it. Maybe its better now, this was a year ago.

▲

jeremyjh 4 days ago | parent [-]

This doesn’t really support your position as far as most readers are concerned - it sounds like a disconnect. If they didn’t do this in any ad copy or public docs it’s not really in Mongo territory.

▲

dijit 4 days ago | parent | next [-]

I don’t really care.

I’m telling you why I am skeptical of any tech that intentionally obfuscates trade-offs, I’m not making a comparison on which of these is worse; and I don’t really care if people take my anecdote seriously either: because they should make their own conclusions.

However it might help people go in to a topic about performance and reliability from a more informed position.

	▲	nchmy 3 days ago \| parent [-]
		I don't doubt your experience. But I think it might have been more just that guy, than NATS in general. The other day i was listening to a podcast with their ceo from maybe 6 months ago, and he talked quite openly about how jetstream and consumers add considerable drag compared to normal pubsub. And, more generally, how users unexpectedly use and abuse nats, and how they've been able to improve things as a result.

▲

zaphirplane 4 days ago | parent | prev [-]

It’s deceptive if true, why are you trying to spin it as it’s ok cause the deception were not published

	▲	jeremyjh 3 days ago \| parent [-]
		Its not ok if there was deception, but it sounds just as likely its a communication disconnect in their call. We only have one side of it.

▲

sea-gold 4 days ago | parent | prev [-]

There is a good comparison between NATS, Kakfa, and others here: https://docs.nats.io/nats-concepts/overview/compare-nats

	▲	rockwotj 3 days ago \| parent \| next [-]
		Maybe needs a neutral party comparison :) The delivery guarantees section alone doesn’t make me trust it. You can do at least once or at most once with kafka. Exactly once is mostly a lie, it depends on the downstream system: unless going back to the same system, the best you can do is at least once with idempotancy
	▲	chatmasta 3 days ago \| parent \| prev [-]
		It’s on the NATS website and “nats” appears in the URL three times, so maybe this isn’t the most objective source.