| ▲ | matll 4 hours ago | ||||||||||||||||
As someone who spent the better part of last year trying to hand-tune kernels for a niche accelerator (not Trainium, but similar vibe), this honestly looks like a dream. The hardest part of this work isn't coming up with the math; it's the mental overhead of managing the scratchpad memory and async DMA calls without stepping on your own toes. You spend 3 days debugging a race condition just to find out you got a 2% speedup. If this tool can actually handle the 'grunt work' of generating the tiling logic and memory moves based on a high-level plan, that’s a game changer. I don't even care about the 17x number as much as I care about the '0 to 1' speed. getting any performant kernel running on new hardware usually takes weeks. If this cuts it down to a few hours of LLM churning, that's huge for the industry. | |||||||||||||||||
| ▲ | simonw 3 hours ago | parent [-] | ||||||||||||||||
Optimization work sounds like it might be a really good fit for coding agents. If you can provide a robust test which "proves" the implementation works the actual work of increasing its performance is the kind of thing a coding agent could run in a loop, testing each optimization to see if the tests still pass and it runs faster. | |||||||||||||||||
| |||||||||||||||||