Remix clone Hacker News

new | show | ask | jobs Github

	▲	mlsu 3 hours ago
		Models on the phone is never going to make sense. If you're loading gigabytes of model weights into memory, you're also pushing gigabytes through the compute for inference. No matter how you slice it, no matter how dense you make the chips, that's going to cost a lot of energy. It's too energy intensive, simple as. "On device" inference (for large LLM I mean) is a total red herring. You basically never want to do it unless you have unique privacy considerations and you've got a power cable attached to the wall. For a phone maybe you would want a very small model (like 3B something in that size) for Siri-like capabilities. On a phone, each query/response is going to cost you 0.5% of your battery. That just isn't tenable for the way these models are being used. Try this for yourself. Load a 7B model on your laptop and talk to it for 30 minutes. These things suck energy like a vacuum, even the shitty models. A network round trip costs gets you hundreds of tokens from a SOTA model and costs 1 joule. By contrast, a single forward pass (one token) of a shitty 7b model costs 1 joule. It's just not tenable.
	▲	russellbeattie 2 hours ago \| parent [-]
		Huh, I hadn't thought of battery limitations. Good call. My initial reaction is that bigger/better batteries, hyper fast recharge times and more efficient processors might address this issue, but I need to learn more about it. That said, power consumption is one of the reasons I think pushing this stuff to the edge is the only real path for AI in terms of a business model. It basically spreads the load and passes the cost of power to the end user, rather than trying to figure out how to pay for it at the data center level.