I have researched for quite a bit and so far the fastest runtime is the oMLX one. But there's a caveat: ttft on MLX on M4 Pro is enormous. On M5 Pro it has been greatly sped up.