Remix.run Logo
jurmous 4 hours ago

We don’t need 200gb of RAM on a phone to run big models. Just 200 GB of storage thanks to Apple’s “LLM in a flash” research.

See: https://x.com/danveloper/status/2034353876753592372

adrian_b 2 hours ago | parent [-]

Yes, I agree that this is the right solution, because for a locally-hosted model I value more the quality of the output than the speed with which it is produced, so I prefer the models as they were originally trained, not with further quantizations.

While that paper praises the Apple advantage in SSD speed, which allows a decent performance for inference with huge models, nowadays SSD speeds equal or greater than that can be achieved in any desktop PC that has dual PCIe 5.0 SSDs, or even one PCIe 5.0 and one PCIe 4.0 SSDs.

Because I had also independently reached this conclusion, like I presume many others, I have just started to work a week ago on modifying llama.cpp to use in an optimal manner weights stored on SSDs, while also batching many tasks, so that they will share each pass through the SSDs. I assume that in the following months we will see more projects in this direction, so the local hosting of very large models will become easier and more widespread, allowing the avoidance of the high risks associated with external providers, like the recent enshittification of Claude Code.