| ▲ | AI Is Writing Its Own Kernels, and They Are 17x Faster(adrs-ucb.notion.site) | ||||||||||||||||||||||||||||||||||
| 62 points by accheng 3 hours ago | 50 comments | |||||||||||||||||||||||||||||||||||
| ▲ | matll 2 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||
As someone who spent the better part of last year trying to hand-tune kernels for a niche accelerator (not Trainium, but similar vibe), this honestly looks like a dream. The hardest part of this work isn't coming up with the math; it's the mental overhead of managing the scratchpad memory and async DMA calls without stepping on your own toes. You spend 3 days debugging a race condition just to find out you got a 2% speedup. If this tool can actually handle the 'grunt work' of generating the tiling logic and memory moves based on a high-level plan, that’s a game changer. I don't even care about the 17x number as much as I care about the '0 to 1' speed. getting any performant kernel running on new hardware usually takes weeks. If this cuts it down to a few hours of LLM churning, that's huge for the industry. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | cadamsdotcom 22 minutes ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Correction: Charles Hong, Sahil Bhatia, Alvin Cheung, and Yakun Sophia Shao, and the ADRS team .. are USING AI to write kernels. “AI” is not writing its own anything. It is doing what humans say to do. | |||||||||||||||||||||||||||||||||||
| ▲ | comrade1234 an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
This is completely believable and you should invest in this technology. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | UncleOxidant an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Was in a startup where we were trying to do this (our tagline was "using AI to make AI run faster and more efficiently"). But we ran out of funding at the end of '22 :( We were just a little early, I think. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | jryio an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Chris Latner of Apple's Swift and Tesla fame is running a company entirely predicated on this, but at the deterministic language design level rather than the inference level. If a beam search, initiative plan and execute phase is more effective than having better tooling in a deterministic programming language then this will clearly take the lead. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | karek an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
usually i scroll past these 'LLM optimizes code' posts bc 99% of them are just finding basic peephole optimizations that -O3 wouldve caught anyway. but looking at the conv1d example in the blog, this is actually doing real architectural changes. the 'dropout' on the optimization menu is a pretty neat hack. kinda reminds me how i work when im stuck... 'ok what if i dont unroll this loop, what else can i do?'. forces the search out of local minima. nice to see an AI tool designed around verification (the simulator loop) rather than just hoping the llm guesses right on the first shot. | |||||||||||||||||||||||||||||||||||
| ▲ | jryio an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
| ▲ | quc1k 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
I really appreciate the focus on interpretability. Usually, super-optimizers give you a blob of assembly that runs fast but is impossible to debug or maintain. By forcing the model to output a natural language 'Plan' first, you essentially get documentation for free. If the code breaks later, you can look at the plan to understand why the loop was unrolled or why the memory was laid out that way. That makes this actually usable in a production CI/CD pipeline, unlike most black-box ML optimizations. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | measurablefunc an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
I wonder if this type of work can be applied towards translating kernels between GPU vendors, e.g. CUDA → AMD. Does anyone know if that's possible or whether that kind of problem is AGI-complete? | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | dataeaa an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Crazy that it beat the hand-tuned amazon kernels. really shows how early we still are with these software stacks. what are the risks of using these kinds of tools thou? Did you get any tricky/silent bugs you had to manually fix? | |||||||||||||||||||||||||||||||||||
| ▲ | mavt6 an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Love the concept of using AI to make the hardware run AI faster. feels like we're finally closing the loop on this stuff! | |||||||||||||||||||||||||||||||||||
| ▲ | pos456 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Calling beam search 'AI' is doing a lot of heavy lifting here. This is just superoptimization with a very expensive heuristic function. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | bvcasd an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
having an agent that looks at the error + the isa spec and trys a fix automatically is worth its weight in gold. turns a frustrating 'read the docs for 2 hours' session into a 5 min background task. thats the kind of QoL stuff that actually gets devs to adopt this. how close is this to being used in production? | |||||||||||||||||||||||||||||||||||
| ▲ | dfdsfds 40 minutes ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Very impressive results! Will be curious to see how correctness is guaranteed and what kind of failures are normal from the LLM-generated code | |||||||||||||||||||||||||||||||||||
| ▲ | maven5t 42 minutes ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
tried using NKI a few months ago and the docs were rough. having the LLM just figure it out from the ISA spec is honestly genius | |||||||||||||||||||||||||||||||||||
| ▲ | pakt1 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Trainium has always been a black box to me compared to GPUs. Seeing an automated tool reverse-engineer the best way to use the VectorEngine vs the TensorEngine is fascinating. It reveals just how much performance is left on the table by standard compilers. | |||||||||||||||||||||||||||||||||||
| ▲ | dksgmlwo 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Fascinating. Having worked as a kernel engineer before, I know how impactful it is to reduce the initial exploration overhead. It can save a huge amount of the grunt work engineers typically have to do. | |||||||||||||||||||||||||||||||||||
| ▲ | taqpos 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
This post unintentionally highlights exactly why NVIDIA is untouchable. If you need a farm of H100s running GPT-5 just to figure out how to program Amazon's Trainium chip efficiently, the hardware abstraction is fundamentally broken. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | yrh 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Interesting read. I think the more "whitebox" approach with a laid out menu to choose from makes the resulting kernel more trustworthy, although it does ask the question if going outside of the predefined steps of optimization from time to time may yield insights. | |||||||||||||||||||||||||||||||||||
| ▲ | qat321 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
I wonder if these results extend beyond AWS Trainium? | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | bgwalter an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
So, Trainium is an architecture that requires brute force to write software for. Maybe if we invest $100 trillion in data centers, we can rewrite the Linux Kernel in Malbolge. | |||||||||||||||||||||||||||||||||||
| ▲ | melissapan 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
ADRS <> Compiler: what if your “compiler” could think? | |||||||||||||||||||||||||||||||||||
| ▲ | igorpcosta 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Very interesting research on this, keen to colab with you folks, I've been building a few experiments for old GTX GPUs to extend lifetime of them with matching performance of tokens for Smol, igor [] autohand.ai let's chat. | |||||||||||||||||||||||||||||||||||
| ▲ | chanwutk 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Very interesting read! | |||||||||||||||||||||||||||||||||||
| ▲ | 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
| [deleted] | |||||||||||||||||||||||||||||||||||
| ▲ | maltese669 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
ngl letting AI fiddle with the kernel sounds scary but the results are really impressive | |||||||||||||||||||||||||||||||||||
| ▲ | incin1234 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
[dead] | |||||||||||||||||||||||||||||||||||
| ▲ | syngrog66 an hour ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||
AI has told me that Biden was preparing for his upcoming debate with Trump. It told me that in May 2025. AI has told me its not raining in my city and that in fact there was 0% chance of it that day. As I was looking out my open front door watching a heavy downpour. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||