Remix clone Hacker News

new | show | ask | jobs Github

	▲	archipelago123 3 hours ago
		The fact that nobody cared to optimize kernels for these hardware platforms proves Nvidia's CUDA moat, especially now that squeezing performance has become so important for serving inference. Hardware ISA is broken => nobody knows how to program the hardware => unoptimized kernels => nobody will use your hardware. Also, bad baselines present opportunities for LLMs to optimize for. Indeed, the kernel that achieved a 17X speedup seems to be a conv1d, which AWS could not care less about optimizing.