Very slow currently, I added the benchmarks in the README. To go faster it needs to implement inference faster than the current float32-only kernels.