Neural Engine is optimized for power efficiency, not performance.
Look for Apple to add matmul acceleration into the GPU instead. Thats how to truly speed up local LLMs.