| ▲ | Aurornis 5 hours ago | |||||||||||||||||||||||||||||||
It wasn't considered impossible. There are examples of large MoE LLMs running on small hardware all over the internet, like giant models on Raspberry Pi 5. It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services. | ||||||||||||||||||||||||||||||||
| ▲ | zozbot234 5 hours ago | parent [-] | |||||||||||||||||||||||||||||||
If the bottleneck is storage bandwidth that's not "slow". It's only slow if you insist on interactive speeds, but the point of this is that you can run cheap inference in bulk on very low-end hardware. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||