Remix.run Logo
mrweasel 4 days ago

I previously helped clients setup and run Kafka clusters. Why they'd need Kafka was always our first question, never got a good answer from a single one of them. That's not to say that Kafka isn't useful, it is, in the right setting, but that settings is never "I need a queue". If you need a queue, great, go get RabbitMQ, ZMQ, Redis, SQS, named pipes, pretty anything but Kafka. It's not that Kafka can't do it, but you are making things harder than they needed to be.

Joeri 4 days ago | parent | next [-]

Kafka isn’t a queue, it’s a distributed log. A partitioned topic can take very large volumes of message writes, persist them indefinitely, deliver them to any subscriber in-order and at-least-once (even for subscribers added after the message was published), and do all of that distributed and HA.

If you need all those things, there just are not a lot of options.

diarrhea 3 days ago | parent | next [-]

Perhaps the best terse summary of Kafka I have come across yet.

holografix 3 days ago | parent | prev | next [-]

Finally a valuable answer thank you

HarHarVeryFunny 3 days ago | parent | prev | next [-]

Why do you say log rather than just publish and subscribe?

exitb 3 days ago | parent | next [-]

Clients don’t have to subscribe to latest messages, but rather can request any available offset range.

halifaxbeard 3 days ago | parent | prev | next [-]

the log stays on kafka for replay until your per log retention settings delete it

HarHarVeryFunny 3 days ago | parent | prev [-]

The way people choose to use feedback on HN never fails to suprise me - we've got a generally intelligent user base here, but the most common type of feedback voting isn't because something is wrong but rather a childish "I don't like it - I want to suppress this comment".

In this case it's something different - this was an honest question, and received two useful replies, so why downvote?! The mental model of people using Kafka is useful to know - in this case the published data being more log-like than stream-like since it's retained per a TTL policy, with each "subscriber" having their own controllable read index.

MYEUHD 3 days ago | parent [-]

> Please don't comment about the voting on comments. It never does any good, and it makes boring reading.

https://news.ycombinator.com/newsguidelines.html#comments

HarHarVeryFunny 3 days ago | parent [-]

Same goes for metacomments about voting.

edem 3 days ago | parent | prev | next [-]

What do you think about Temporal?

cyberpunk 3 days ago | parent [-]

Okay for small numbers of high value jobs (e.g uber trips or food deliveries etc), prohibitively expensive for anything you need even a few k/sec of.

mannycalavera42 3 days ago | parent | prev [-]

HERO

EdwardDiego 4 days ago | parent | prev | next [-]

I'd started using it at v0.8 at a previous adtech company because my problem was "We generate terabytes of events a day we need to process and aggregate and bill on, how the hell do we move this data around reliably?"

The data team I'd inherited had started with NFS and shell scripts, before a brief detour into GlusterFS after NFS proved to be, well, NFS. GlusterFS was no better.

Using S3 was better, but we still hit data loss problems (on our end, not S3 's, to clear) which isn't great when you need to bill on some of that data.

Then I heard about Kafka, bought a copy of I <3 Logs, and decided that maybe Kafka was worth the complexity, and boom, it was. No more data loss, and a happier business management.

I was headhunted for my current company for my Kafka experience. First thing I realised when I looked at the product was - "Ah, we don't need Kafka for this."

But the VP responsible was insistent. So now I spend a lot of time doing education on how to use Kafka properly.

And the very first thing I start with is "Kafka is not a queue. It's a big dumb pipe that does very smart things to move data efficiently and with minimal risk of data loss - and the smartest thing it does, it choosing to be very dumb.

Want to synchronously know if your message was consumed? Kafka don't care. You need a queue."

Gee101 16 hours ago | parent | next [-]

Have you found a good Head for Kafka to easily query the Topics using a SQL like language? Especially something that can infer table schema from the Schema Registry.

edem 3 days ago | parent | prev [-]

Do you have a blog somewhere? Where do I learn how to use Kafka properly? I like the idea behind it, but its quirks...not so much.

slau 4 days ago | parent | prev | next [-]

ZMQ is not a managed queue. It’s networking library.

kachapopopow 3 days ago | parent | prev | next [-]

I do not recommend Redis (janky implementation, subscribers drop randomly, java libraries are a crime against humanity) or RabbitMQ (memory issues), ZMQ is not a messaging queue, named pipes are not reliable and what the hell is SQS.

prerok 3 days ago | parent | next [-]

SQS is simple queueing service in AWS. It's ok, guarantees at least one time delivery, but I am not sure how useful it is for large volumes of messages (by this I just mean that we use it for low volume messages and I don't have experience when using larger volumes).

stickfigure 3 days ago | parent [-]

SQS is fantastic at exceptionally high total volumes of messages - you probably can't saturate it. But it's not great for streaming a list of ordered messages. SQS has a FIFO mode but performance will never be what you can get out of Kafka.

Also, SQS isn't pub/sub. Kafka and SQS really have very different use cases.

prerok 2 days ago | parent [-]

Agreed, I was just trying to answer the parent's question as to what it is.

gigatexal 3 days ago | parent | prev [-]

How have you had so many issues with Redis. We used it at a previous place it was basically bullet proof. That being said we didn’t use Java but Python. Idk.

kachapopopow 3 days ago | parent [-]

lettuce - doesn't reconnect properly during redis restarts (1/10 chance) jedis - subscriptions drop and stop receiving for no reason

however, my latest wrapper for jedis does seem to be holding up and haven't had too many issues, but I have a very robust checking for dropped connections.

HarHarVeryFunny 3 days ago | parent | prev | next [-]

IMO, recommending RabbitMQ depends on what language you are using and how well suited the available client libraries are to your use case.

I used RabbitMQ a few years back on a C++ project, and at the time (has anything changed?) the best supported C++ client library seemed to be AMQP-CPP which isn't multi-thread safe, therefore requiring an application interface layer to need to be written to address this in a performant way.

In our case we wanted to migrate a large legacy system from CORBA (point to point) to a more flexible bus-based architecture, so we also had to implement a CORBA-like RPC layer on top of Rabbit, including support for synchronous delivery failure detection, which required more infrastructure to be built on top of AMQP-CPP. In the end the migration was successful, but it felt like we were fighting AMQP-CPP a lot of the way.

SJC_Hacker 2 days ago | parent [-]

Out of curiousity what was the issue with just wrapping the AMQP-CPP pub/sub calls around a mutex?

HarHarVeryFunny 11 hours ago | parent [-]

I'm a bit hazy on the full details because it was a few years ago, but basically it gets more complicated because you subscribe by installing an async callback which needs to ack/nak messages, needing locking, and will be called from the context of the network event loop that also needs locking. If you do any real work in the message processing callback then you'll be blocking the event loop, so the callback has to defer processing by queuing C++ lambdas capturing the context, and running those in a thread pool.

adev_ 4 days ago | parent | prev | next [-]

> Why they'd need Kafka was always our first question, never got a good answer from a single one of them

"To follow the hype train, Bro" is often the real answer.

> If you need a queue, great, go get RabbitMQ, ZMQ, Redis, SQS, named pipes, pretty anything but Kafka.

Or just freaking MQTT.

MQTT has been battle-proven for 25 years, is simple and does perfectly the job if you do not ship GBs of blobs through your messaging system (which you should not do anyway).

atomicnumber3 4 days ago | parent | next [-]

It's resume-driven development. It honestly can make sense for both company and employee.

Companies get standard tech stacks people are happy to work with, because working with them gets people experience with tech stacks that are standard at many companies. It's a virtuous cycle.

And sure even if you need just a specific thing, it's often better to go slightly overkill for something that's got millions of stack overflow solutions for common issues figured out. Vs picking some niche thing that you are now 1 of like six total people in the entire world using in prod.

Obviously the dose makes the poison and don't use kafka for your small internal app thing and don't use k8s where docker will do, but also, probably use k8s if you need more than docker instead of using some weird other thing nobody will know about.

laughing_man 3 days ago | parent [-]

That's what happened where I worked. The people making the tech decisions were worried they weren't "keeping up" and committed us all-in on kafka. That decision cost the company millions.

adev_ 3 days ago | parent [-]

> That decision cost the company millions.

And 5 years later the responsible of the decision left the company with a giant pile of mess behind him/her.

But let's see things positively: he can now add "Kafka at scale" on the CV.

laughing_man 3 days ago | parent [-]

That is exactly what happened.

munksbeer 3 days ago | parent | prev | next [-]

> Or just freaking MQTT.

Disclaimer: I'm a dev and I'm not very familiar with the actual maintenance of kafka clusters. But we run the aws managed service version (MSK), and it seems to just pretty much work.

We send terrabytes of data through kafka asynchronously, because of its HA properties and persistent log, allowing consumers to consume in their own time and put the data where it needs to be. So imagine, many apps across our entire stack have the same basic requirement, publish a lot of data which people want to analyse somewhere later. Kafka gives us a single mechanism to do that.

So now my question. I've never used MQTT before. What are the benefits of using MQTT in our setup vs using kafka?

cowanon2222 3 days ago | parent [-]

I use MQTT daily. I'm not sure why the commenter suggested it; it is orthogonal to queueing or log streams.

MQTT is a publish/subscribe protocol for large scale distributed messaging, often used in small embedded devices or factories. It is made for efficient transfer of small, often byte sized payloads of IoT device data. It does not replace Kafka or RabbitMQ - messages should be read off of the MQTT broker as quickly as possible. ( I know this from experience - MQTT brokers get bogged down rapidly if there are too many messages "in flight")

A very common pattern is to use MQTT for communications, and then Kafka or RabbitMq for large scale queuing of those messages for downstream applications.

adev_ 3 days ago | parent | next [-]

> it is orthogonal to queueing or log streams.

That is currently the problem.

A lot of usage of Kafka I have seen in the wild are not for log stream or queing but deployed as a simple pub/sub messaging service because "why not".

munksbeer 3 days ago | parent | prev [-]

Thank you.

mdaniel 3 days ago | parent | prev | next [-]

I presume one will want to use https://github.com/eclipse-mosquitto/mosquitto if going that route because I seem to recall the "mainstream" MQTT project doing a rugpull but since I'm not deeply in that community, I don't have substantiating links handy

physicles 3 days ago | parent | prev [-]

MQTT and Kafka solve different problems. At my current company, we use both.

Kafka isn’t a queue. It’s overkill to use it as one.

Kafka is a great place to persist data for minutes, hours or days before it’s processed. It fully decouples producers and consumers. It’s also stupidly complex and very hard to operate reliably in an HA configuration.

MQTT is good for when data needs to leave or enter your cloud, but persistence is bolted on (at least it is in mosquitto), so a crash means lost data even though you got a PUBACK.

selkin 3 days ago | parent | prev | next [-]

KIP-932[0] adds queue semantics for Kafka. You may still want to use another queue though: as always, no one size fits all.

[0] https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A...

hvb2 3 days ago | parent | next [-]

I CAN drive a Ferrari to the grocery store.

Sure, it can do it. But it's not efficient or what it's good at.

EdwardDiego 3 days ago | parent | prev [-]

I'm really unsure where the drive for this is coming from tbh. (cough CFLT share price since IPO, big enterprise customers cough) If this was so desirable, everyone would've jumped ship to Pulsar already.

GoblinSlayer 2 days ago | parent | prev [-]

I think our salesmen were happier to sell kafka, because it's enterprisey. Redis is better? Well, now we use kafka and redis.