| ▲ | Tuna-Fish a day ago | |||||||
Time for my daily "HBF is coming" comment. The next step for models is to put the weights on flash, connected with a very wide interface to the accelerator. The first users will be datacenters, but it should trickle down to consumer hardware eventually. A single 512GB stack is expected to cost about $200, and provide 1.6TB/s of reads. You still need some fast DRAM for the KV cache and for activations, but weights should be sitting on flash. | ||||||||
| ▲ | zozbot234 a day ago | parent | next [-] | |||||||
Reading from Flash is too power-intensive compared to DRAM, this is why Flash offload isn't used in the data center today. Flash is also prone to wearing out quickly so ephemeral data like the KV-cache can't really be stashed in there. Unless your model has an unprecedented level of sparsity I just don't see how HBF could ever be useful. | ||||||||
| ||||||||
| ▲ | nickpsecurity a day ago | parent | prev [-] | |||||||
You're thinking in a provably-useful direction: | ||||||||
| ||||||||