Remix.run Logo
flohofwoe 3 hours ago

> 1000x in AVX512+days of thought compared to the naive version written in a python loop

Out of this 1000x speedup you get 100x by just not using python though ;)

Also IIRC the main problem specifically with AVX512 was that mainstream CPUs simply didn't have it, so a smart compiler won't be of much use when the output code only runs on a handful devices.