This post unintentionally highlights exactly why NVIDIA is untouchable. If you need a farm of H100s running GPT-5 just to figure out how to program Amazon's Trainium chip efficiently, the hardware abstraction is fundamentally broken.

▲

CobbledSteel 4 hours ago | parent [-]

I'd argue the logic goes the other way, if all it takes to get high performant kernels is to rent a GPU farm, that seems to undercut the years and millions of engineering hours required to build the NVIDIA SW infrastructure. High hopes for smaller players now

	▲	archipelago123 3 hours ago \| parent [-]
		The fact that nobody cared to optimize kernels for these hardware platforms proves Nvidia's CUDA moat, especially now that squeezing performance has become so important for serving inference. Hardware ISA is broken => nobody knows how to program the hardware => unoptimized kernels => nobody will use your hardware. Also, bad baselines present opportunities for LLMs to optimize for. Indeed, the kernel that achieved a 17X speedup seems to be a conv1d, which AWS could not care less about optimizing.