▲ | timschmidt 19 hours ago | |
llamafile contains specific optimizations for prompt processing using AVX512 for dealing with just this issue: https://justine.lol/matmul/ (about a 10x speedup over llama.cpp) Somewhere between 8 and 192 cores I'm sure there's enough AVX512 to get the job done. And we've managed to reinvent Intel's Larrabee / Knights concept. Sadly, the highly optimized AVX512 kernels of llamafile don't support these exotic floats yet as far as I know. Yes, energy efficiency per query will be terrible compared to a hyperscaler. However privacy will be perfect. Flexibility will be higher than other options - as running on the CPU is almost always possible. Even with new algorithms and experimental models. | ||
▲ | ein0p 19 hours ago | parent [-] | |
At 192 cores you're way better off buying a Mac Studio, though. |