| ▲ | Aurornis 3 hours ago | |
Most people won’t be dumping 100K tokens into it at once, but I agree that all of the prefill time that adds up during a session becomes a lot to account for. This is also a problem for all of the Mac local LLMs. Macs are a great way to get a lot of high bandwidth memory, but their compute is very far behind current gen dedicated GPUs. Some of the expensive Mac Studio setups allow you to run very large models with usable tokens/s, but you can be waiting a long time for it to get to the point of generating those tokens. | ||