| ▲ | Show HN: IEEE-754-Conformant FP64 on Metal (Apple Silicon)(github.com) | |
| 1 points by guyfischman 6 hours ago | 1 comments | ||
| ▲ | guyfischman 6 hours ago | parent [-] | |
Bit-exact SW-emulated FP64 on Metal, 5-11x faster than CPU HW-accelerated FP64. I was learning about randomx and wanted to play with the algorithm on Mac, discovered Metal has no FP64 math. Further discovered this has been a frustration for a lot of people in ML/Science/Gaming. I went down a rabbit hole. The naive implementation was ~10% the throughput of hardware CPU fp64 on the same machine. After obsessively squeezing every bit of juice out of the GPU, the final version is 5–11× faster than a 14-thread CPU hardware-fp64 baseline on arithmetic, and 10–35× on conversions and comparisons (M4 Pro, 20 GPU cores). I hope you find this useful. | ||