Remix.run Logo
djoldman 3 days ago

> In 2010, LinkedIn had 90 million members. Today, we serve over 1.2 billion members on LinkedIn. Unsurprisingly, this increase has created some challenges over the years, making it difficult to keep up with the rapid growth in the number, volume, and complexity of Kafka use cases. Supporting these use-cases meant running Kafka at a scale of over 32T records/day at 17 PB/day on 400K topics distributed across 10K+ machines within 150 clusters.

https://www.linkedin.com/blog/engineering/infrastructure/int...

enether 3 days ago | parent [-]

~197 GB/s ... nice.

I believe these companies save literally every ounce of data they can find. Once you have the infra and teams for it, it seems easy to make a case for storing something.

Similarly, Uber has shared they push 89 GB/s through Kafka - 7.7 PB/s. People always ask me - what is a taxi/food-delivery app storing so much

alt227 3 days ago | parent [-]

The biggest capacity HD available today is 30TB.

17PB = about 567 of those drives... being totally filled... per day.

I was hoping somebody would come and say this is a simple spelling error or something.

The cost of the drives alone seems astronomical, let alone the logistics of the data center keeping up with storing that much data.

EDIT: I have just realised that they are probably only processing at this speed, rather than storing it, can anyone confirm if they store all the logs they process?

enether 3 days ago | parent [-]

I would assume storage varies greatly. I know that LinkedIn quoted an average read fanout ratio of 5.5x in Kafka, meaning each byte was read 5.5x times. Assuming that is still true, we ought to divide by 6.5x to get to the daily write amount

That comes out to 87 disks a day. Assuming a 7 day retention period (this is on the high side), it’s not unthinkable to have a 600-1800 disk deployment (accounting for replication copies)

lossolo 3 days ago | parent [-]

> That comes out to 87 disks a day. Assuming a 7 day retention period (this is on the high side), it’s not unthinkable to have a 600-1800 disk deployment (accounting for replication copies)

Yep. Whole week can be easily stored in 1-2 racks.