▲ | qrios 5 days ago | ||||||||||||||||
Current "good enough" models like Mistral Small require GPUs like the RTX 6000 to achieve user-friendly response times. The model quality is good enough, especially for narrow-scope tasks like summarization and classification. If Moore's Law holds for a few more years, a mobile device will be able to run it on-device in around 8 years (Apple's A11: 410 GFLOPS vs. RTX 6000: 16 TFLOPS [1]). This is under the assumption that we don't see any significant optimization in the meantime. Looking back over the last eight years, the probability of no progress on the software side is near zero. For a breakthrough in the consumer market, running LLM on-device with today's capabilities requires solving one key topic: "JIT learning" [2]. We can see some progress here [3, 4]. Perhaps the transformer architecture is not the best for this requirement, but it is hard to argue that it is impossible for Generative AI. Due to today's technical limitations, we don't have real personal assistants. This could be the Mac for Apple in the AI era. [1] https://gadgetversus.com/graphics-card/apple-a11-bionic-gpu-... [2] Increasing context size is not a valid option for my scenario as it also increases the computation demand linear. [2] https://arxiv.org/abs/2311.06668 [3] https://arxiv.org/abs/2305.18466 [Edit: decimal separator mess] | |||||||||||||||||
▲ | buyucu 5 days ago | parent | next [-] | ||||||||||||||||
Inference is getting cheaper by the minute, because hardware is getting cheaper and also because smarter ideas like latent attention are spreading. | |||||||||||||||||
▲ | mingus88 5 days ago | parent | prev [-] | ||||||||||||||||
Apple’s answer to that is Private Compute Cloud | |||||||||||||||||
|