Remix.run Logo
qrios 5 days ago

Current "good enough" models like Mistral Small require GPUs like the RTX 6000 to achieve user-friendly response times. The model quality is good enough, especially for narrow-scope tasks like summarization and classification. If Moore's Law holds for a few more years, a mobile device will be able to run it on-device in around 8 years (Apple's A11: 410 GFLOPS vs. RTX 6000: 16 TFLOPS [1]).

This is under the assumption that we don't see any significant optimization in the meantime. Looking back over the last eight years, the probability of no progress on the software side is near zero.

For a breakthrough in the consumer market, running LLM on-device with today's capabilities requires solving one key topic: "JIT learning" [2]. We can see some progress here [3, 4]. Perhaps the transformer architecture is not the best for this requirement, but it is hard to argue that it is impossible for Generative AI.

Due to today's technical limitations, we don't have real personal assistants. This could be the Mac for Apple in the AI era.

[1] https://gadgetversus.com/graphics-card/apple-a11-bionic-gpu-...

[2] Increasing context size is not a valid option for my scenario as it also increases the computation demand linear.

[2] https://arxiv.org/abs/2311.06668

[3] https://arxiv.org/abs/2305.18466

[Edit: decimal separator mess]

buyucu 5 days ago | parent | next [-]

Inference is getting cheaper by the minute, because hardware is getting cheaper and also because smarter ideas like latent attention are spreading.

mingus88 5 days ago | parent | prev [-]

Apple’s answer to that is Private Compute Cloud

qrios 5 days ago | parent [-]

Isn't "Private Compute Cloud" just a marketing term with some restrict sec architecture? The real personal assistant LLM would mean to have the realtime data available in hot memory (to make sure to give instant responses).

Audio, video, screen recordings, etc. from a single customer could be something between 1 and 10 GByte per day on average. After training you might get something like 3 MByte in additional model size per day. Even with 1 billion active users you would need to store additional data with 1 billion GByte (again on hot storage, like expensive GPU memory). The total amount of the memory of GPUs sold by NVIDIA is not even close to 400mio GByte (NVIDIA 3.8mio data center GPUs in 2023).

therealpygon 4 days ago | parent [-]

That assumes you are trying to compress every bit of information ingested at all times. We have a document so we can reference it, we don’t need video of the steps to create it. If we know we drove to work and nothing significant happened, we don’t need to store every detail from the drive.

When distilled, most people’s days consist of very few actual newly discovered facts, decisions and changes to context.