Remix.run Logo
majormajor 4 days ago

What off the shelf tools in 2012 would you propose, exactly?

tomrod 4 days ago | parent | next [-]

Sounds like MQTT?

majormajor 4 days ago | parent [-]

MQTT wouldn't give you the persistence or the decoupling of fast and slow consumers.

zug_zug 4 days ago | parent | prev [-]

Make it less event-orchestrated and use a db. It’s just a social network for recruiters it’s not as complicated as they like to pretend.

You don’t need push, it’s just a performance optimization that almost never justifies using a whole new tool.

majormajor 4 days ago | parent | next [-]

So solve "ETLs into a data warehouse are hard to make low-latency and hard to manage in a large org" by... just hypothetical better "off the shelf tools". Or "don't want low latency because you're 'just' a recruiting tool, so who cares how quickly you can get insights into your business."

Go back to the article, it wasn't about event-sourcing or replacing a DB for application code.

inkyoto 3 days ago | parent | prev | next [-]

> It’s just a social network for recruiters it’s not as complicated as they like to pretend.

Dismissing this as «just a social network» understates the real constraints: enormous scale, global privacy rules, graph queries, near-real-time feeds and abuse controls. Periodic DB queries can work at small scale, but at high volume they either arrive late or create bursts that starve the primary. Capturing changes once and pushing them through a distributed transaction log such as Kafka evens out load, improves data timeliness and lets multiple consumers process events safely and independently. It does add operational duties – schema contracts, idempotency and retention – yet those are well-understood trade-offs. The question is not push versus pull in the abstract, but which approach meets the timeliness, fan-out and reliability required.

> You don’t need push, it’s just a performance optimization that almost never justifies using a whole new tool.

It is not about drama but about fit for purpose at scale.

Pull can work well for modest workloads or narrow deltas, especially with DB features such as incremental materialised views or change tables. At large scale, periodic querying becomes costly and late: you either poll frequently and hammer the primary, or poll infrequently and accept stale data. Even with cursoring and jitter, polls create bursty load and poor tail latencies.

Push via change data capture into a distributed log such as Kafka addresses such pain points. The log decouples producers from consumers, smooths load, improves timeliness and lets multiple processors scale independently and replay for backfills. It also keeps the OLTP database focused on transactions rather than fan-out reads.

This is not free: push introduces operational work and design care – schema contracts, ordering being per-partition, duplicate delivery and idempotency, back-pressure and retention governance including data-protection deletes. The usual mitigations are the outbox pattern, idempotent consumers, DLQ's and documented data contracts. The data processing complexity now belongs in each consumer, not the data processing engine (e.g. a DB).

Compute–storage separation in modern databases raises single-cluster ceilings for storage and read scale, yet it does not solve single-writer limits or multi-region active-active writes. For heavy write fan-out and near-real-time propagation, a CDC-to-log pipeline remains the safer bet.

To sum it up, both pull and push are valid – engineering is all about each specific use case assessment and the trade-off analysis. For small or bounded scopes, a well-designed pull loop is simpler. As scale, fan-out and timeliness requirements grow, push delivers better timeliness, correctness and operability.

sebastialonso 4 days ago | parent | prev | next [-]

The only correct answer to the question asked is "I don't know the context, I need more information". Anything else is being a bad engineer.

AtlasBarfed 4 days ago | parent | prev [-]

Your solution to a queue and publish subscribe problem is to use a database?

mrkeen 3 days ago | parent [-]

Adding onto this.

> LinkedIn used site activity data (e.g. someone liked this, someone posted this)1 for many things - tracking fraud/abuse, matching jobs to users, training ML models, basic features of the website (e.g who viewed your profile, the newsfeed), warehouse ingestion for offline analysis/reporting and etc.

Who controls the database? Is it the fraud/abuse team responsible for the migrations? Does the ML team tell the Newsfeed team to stop doing so many writes because it's slowing things down?