Time for my daily "HBF is coming" comment.

The next step for models is to put the weights on flash, connected with a very wide interface to the accelerator. The first users will be datacenters, but it should trickle down to consumer hardware eventually. A single 512GB stack is expected to cost about $200, and provide 1.6TB/s of reads.

You still need some fast DRAM for the KV cache and for activations, but weights should be sitting on flash.

▲

zozbot234 a day ago | parent | next [-]

Reading from Flash is too power-intensive compared to DRAM, this is why Flash offload isn't used in the data center today. Flash is also prone to wearing out quickly so ephemeral data like the KV-cache can't really be stashed in there. Unless your model has an unprecedented level of sparsity I just don't see how HBF could ever be useful.

	▲	Tuna-Fish 20 hours ago \| parent [-]
		Currently available flash is obviously unusable. HBF is not that. The reason HBF is (about to be) a thing is that flash manufacturers realized that if you heavily optimize flash for read throughput and energy, as opposed to density, you can match DRAM on throughput and get to within 2x on energy, at the cost of half your density. That would make the density still ~50 times better than DRAM, built on a cheap mass-produced process. All manufacturers are chasing this hard right now, with first samples to arrive later this year. You are correct that it would absolutely not be used for any mutable data, only weights in inference. This is both because there is insufficient endurance (expected to be ~hundreds of drive writes total), but also because it will be very slow to write compared to the read speed. A single HBF stack is expected to provide 1.6TB/s reads, and single-digit GB/s writes. That's why I wrote the last sentence of my post that you replied to.

▲

nickpsecurity a day ago | parent | prev [-]

You're thinking in a provably-useful direction:

https://arxiv.org/pdf/2312.11514

	▲	Tuna-Fish 20 hours ago \| parent [-]
		HBF is not that. The paper you linked is about how to use flash memory that exists to boost LLM performance, with all kinds of optimization tricks. HBF is about making flash memory that doesn't require any of those tricks, and just has the read throughput that's needed for inference.