▲ | physicles 3 days ago | |
Totally agree with this. I’ll add that replaying your data needs special tooling to 1) find the correct offsets on each topic, and 2) spin up whatever daemon will consume that data out-of-band from normal processing, and shut it down when completed. I don’t remember where I read this, but someone made the observation that writing a stream processing system is about 3x harder than writing a batch system, exactly for all the reasons you mentioned. I’m looking at replacing some of our Kafka usage with a clickhouse table that’s ordered and partitioned by insertion time, because if I want to do stuff with that data stream, at least I can do a damn SQL query. | ||
▲ | fifilura 3 days ago | parent [-] | |
Yes I'll happily extend that to 10x more difficult. At least compared to building a batched pipeline with SQL. I think you should really think hard whether you really need a streaming pipeline. And even if you find that you do, it may be worthwhile to make a batched pipeline as your first implementation. I did exactly what you describe in my previous job. In the beginning with reluctance from our architects who wanted to keep banging the dead horse and did not understand the power of SQL "SQL is not real programming, engineers write java" (ok maybe I deserve a straw-man yellow card here, they don't deserve all of that). But I think they understood after a while. With AWS Athena and Airflow. Good luck, consider me your distant moral support. |