As someone who spent the better part of last year trying to hand-tune kernels for a niche accelerator (not Trainium, but similar vibe), this honestly looks like a dream.

The hardest part of this work isn't coming up with the math; it's the mental overhead of managing the scratchpad memory and async DMA calls without stepping on your own toes. You spend 3 days debugging a race condition just to find out you got a 2% speedup.

If this tool can actually handle the 'grunt work' of generating the tiling logic and memory moves based on a high-level plan, that’s a game changer. I don't even care about the 17x number as much as I care about the '0 to 1' speed. getting any performant kernel running on new hardware usually takes weeks. If this cuts it down to a few hours of LLM churning, that's huge for the industry.

▲

simonw 3 hours ago | parent [-]

Optimization work sounds like it might be a really good fit for coding agents. If you can provide a robust test which "proves" the implementation works the actual work of increasing its performance is the kind of thing a coding agent could run in a loop, testing each optimization to see if the tests still pass and it runs faster.

▲

whynotmaybe 3 hours ago | parent [-]

But we might end up with "work on my infrastructure" optimization that would be hard to reproduce.

Like that research that evolved an FPGA where some unconnected parts where crucial for the the expected behaviour.

https://www.eetimes.com/whatever-happened-to-evolvable-hardw...

	▲	mholm 3 hours ago \| parent [-]
		Adding a few diverse hardware environments available for testing during the duration would mitigate this. Many companies wouldn't have any issues having infrastructure specific optimizations either. (Part of) Deepseek's big advantage over their chinese competitors was their intelligent use of the hardware, after all.