Remix.run Logo
Someone 4 days ago

I guess that hardware doesn’t make things faster (¿yet?). If so I guess they would have mentioned it in https://machinelearning.apple.com/research/core-ml-on-device.... That is updated for Sequoia and says

“This technical post details how to optimize and deploy an LLM to Apple silicon, achieving the performance required for real time use cases. In this example we use Llama-3.1-8B-Instruct, a popular mid-size LLM, and we show how using Apple’s Core ML framework and the optimizations described here, this model can be run locally on a Mac with M1 Max with about ~33 tokens/s decoding speed. While this post focuses on a particular Llama model, the principles outlined here apply generally to other transformer-based LLMs of different sizes.”

cma 3 days ago | parent [-]

If it uses a lot less power it could still be a win for some use cases, like while on battery you might still want to run transformer based speech to text, RTX voice-like microphone denoising, image generation/infill in photo editing programs. In some use cases like RTX-voice like stuff during multiplayer gaming, you might want the GPU free to run the game even if it still suffers some memory bandwidth impact from having it running.