Remix.run Logo
cubefox 3 hours ago

I'm confused, that doesn't make sense to me:

> They largely come from hyperscalers who want hard drives for their AI data centers, for example to store training data on them.

What type of training data? LLMs need relatively little of that. For example, DeepSeek-V3 [1], still a relatively large model:

> We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens

At 2 bytes per token, that's 29.6 terabytes. That's basically nothing compared to the amount of 4K content that is uploaded to YouTube every day.

1: https://arxiv.org/html/2412.19437v1

citrin_ru 5 minutes ago | parent | next [-]

A few random thoughts:

There are many new data-centers they are being filled with servers. Most servers have at least 2 HDD (mirror) for the OS. I would not be surprised if on a huge scale even 2 HDD per server could cause HDD shortage.

There are likely models which are trained on 4k video and it should be stored somewhere too.

Even things like logs and metrics can consume petabytes for a large (and complex) cluster. And the less mature the software the more logs you need to debug it in production.

In the AI race investments if not unlimited at least abundant. In such conditions optimization of hardware usage is the waste of time and velocity is the only things which matters.

Jach 2 hours ago | parent | prev | next [-]

You may have answered your own question if they're wanting to train models on video and other media.

greatgib 2 hours ago | parent | prev [-]

Honestly looks highly suspicious to me. Because ok they might need some big storage like petabits. But how can this be a match in proportion with the capacity that is currently usually needed for everything that is hard drive hungry. Any cloud service, any storage service, all the storage needed for private photo/video/media storage for everything that is produced everyday, for all consumer hardwares like computers...

Gpu I understand but hard drive looks excessive. It's like if tomorrow there is a shortage of computer cabling because ai datacenter needs some.

zozbot234 2 hours ago | parent [-]

If you're building for future training needs and not just present, it makes more sense. Scaling laws say the more data you have, the smarter and more knowledgeable your AI model gets in the end. So that extra storage can be quite valuable.

coffeebeqn 2 hours ago | parent [-]

If you’re building a text-only model then the storage is limited but once you get to things like video then it’ll explode exponentially