AI Is Writing Its Own Kernels, and They Are 17x Faster

As someone who spent the better part of last year trying to hand-tune kernels for a niche accelerator (not Trainium, but similar vibe), this honestly looks like a dream.

The hardest part of this work isn't coming up with the math; it's the mental overhead of managing the scratchpad memory and async DMA calls without stepping on your own toes. You spend 3 days debugging a race condition just to find out you got a 2% speedup.

If this tool can actually handle the 'grunt work' of generating the tiling logic and memory moves based on a high-level plan, that’s a game changer. I don't even care about the 17x number as much as I care about the '0 to 1' speed. getting any performant kernel running on new hardware usually takes weeks. If this cuts it down to a few hours of LLM churning, that's huge for the industry.

▲

simonw an hour ago | parent [-]

Optimization work sounds like it might be a really good fit for coding agents. If you can provide a robust test which "proves" the implementation works the actual work of increasing its performance is the kind of thing a coding agent could run in a loop, testing each optimization to see if the tests still pass and it runs faster.

▲

whynotmaybe an hour ago | parent [-]

But we might end up with "work on my infrastructure" optimization that would be hard to reproduce.

Like that research that evolved an FPGA where some unconnected parts where crucial for the the expected behaviour.

https://www.eetimes.com/whatever-happened-to-evolvable-hardw...

	▲	mholm an hour ago \| parent [-]
		Adding a few diverse hardware environments available for testing during the duration would mitigate this. Many companies wouldn't have any issues having infrastructure specific optimizations either. (Part of) Deepseek's big advantage over their chinese competitors was their intelligent use of the hardware, after all.

▲

cadamsdotcom 22 minutes ago | parent | prev | next [-]

Correction:

Charles Hong, Sahil Bhatia, Alvin Cheung, and Yakun Sophia Shao, and the ADRS team ..

are USING AI to write kernels.

“AI” is not writing its own anything.

It is doing what humans say to do.

▲

comrade1234 an hour ago | parent | prev | next [-]

This is completely believable and you should invest in this technology.

▲

DroneBetter an hour ago | parent [-]

I can't tell whether you're trying to convince humans, parody someone who might be, or give superficial sentiment for automated traders' webscrapers to be influenced by

	▲	cornonthecobra 38 minutes ago \| parent \| next [-]
		or they left the /s off and it's a remark about how the fine article sounds more like hype-machine emesis than legitimate, substantive research
	▲	oceansky 44 minutes ago \| parent \| prev [-]
		I think he's just being extremely ironic, meaning the exact opposite of what it actually says.

▲

UncleOxidant an hour ago | parent | prev | next [-]

Was in a startup where we were trying to do this (our tagline was "using AI to make AI run faster and more efficiently"). But we ran out of funding at the end of '22 :(

We were just a little early, I think.

	▲	accheng an hour ago \| parent [-]
		Interesting, did you have any learnings that would apply to this problem now?

▲

jryio an hour ago | parent | prev | next [-]

Chris Latner of Apple's Swift and Tesla fame is running a company entirely predicated on this, but at the deterministic language design level rather than the inference level.

https://www.modular.com/mojo

If a beam search, initiative plan and execute phase is more effective than having better tooling in a deterministic programming language then this will clearly take the lead.

	▲	accheng an hour ago \| parent [-]
		Thanks for the link! I am not familiar with the company but reminds me of the whole formal methods debate in distributed systems. Sure, writing TLA+ specs is the 'correct' deterministic way to build a Raft implementation, but in reality everyone just writes messy Go/Java and patches bugs as they pop up because its faster.

▲

karek an hour ago | parent | prev | next [-]

usually i scroll past these 'LLM optimizes code' posts bc 99% of them are just finding basic peephole optimizations that -O3 wouldve caught anyway. but looking at the conv1d example in the blog, this is actually doing real architectural changes.

the 'dropout' on the optimization menu is a pretty neat hack. kinda reminds me how i work when im stuck... 'ok what if i dont unroll this loop, what else can i do?'. forces the search out of local minima. nice to see an AI tool designed around verification (the simulator loop) rather than just hoping the llm guesses right on the first shot.

▲

jryio an hour ago | parent | prev | next [-]

paper: https://arxiv.org/abs/2505.18574

▲

quc1k 2 hours ago | parent | prev | next [-]

I really appreciate the focus on interpretability. Usually, super-optimizers give you a blob of assembly that runs fast but is impossible to debug or maintain. By forcing the model to output a natural language 'Plan' first, you essentially get documentation for free. If the code breaks later, you can look at the plan to understand why the loop was unrolled or why the memory was laid out that way. That makes this actually usable in a production CI/CD pipeline, unlike most black-box ML optimizations.

	▲	kap901 an hour ago \| parent [-]
		manually writing tiling logic for systolic arrays is the absolute worst. if this actually works it saves me so much headache.

▲

measurablefunc an hour ago | parent | prev | next [-]

I wonder if this type of work can be applied towards translating kernels between GPU vendors, e.g. CUDA → AMD. Does anyone know if that's possible or whether that kind of problem is AGI-complete?

▲

jryio an hour ago | parent | next [-]

There's a higher level of abstraction

https://www.modular.com/mojo

	▲	measurablefunc 38 minutes ago \| parent [-]
		So if CUDA could be ported to Mojo w/ AI then it would be basically available for any GPU/accelerator vendor. Seems like the right kind of approach towards making CUDA a non-issue.

▲

UncleOxidant an hour ago | parent | prev [-]

It seems like it could be possible now with a bit of work. I don't think that it would require AGI. Didn't AMD have (or fund) something like this and then decide not to pursue it further recently? It was called HIP. There's also ZLUDA https://www.blopig.com/blog/2024/03/an-open-source-cuda-for-...

	▲	measurablefunc an hour ago \| parent [-]
		Very interesting.

▲

dataeaa an hour ago | parent | prev | next [-]

Crazy that it beat the hand-tuned amazon kernels. really shows how early we still are with these software stacks.

what are the risks of using these kinds of tools thou? Did you get any tricky/silent bugs you had to manually fix?

▲

mavt6 an hour ago | parent | prev | next [-]

Love the concept of using AI to make the hardware run AI faster. feels like we're finally closing the loop on this stuff!

▲

pos456 2 hours ago | parent | prev | next [-]

Calling beam search 'AI' is doing a lot of heavy lifting here. This is just superoptimization with a very expensive heuristic function.

▲

jryio an hour ago | parent [-]

That's correct - however as other commenters have noted. Doing this by hand is extremely challenging for human engineers working on tensor kernels.

The expense calculation might be

expense of improvement = (time taken per optimization step * cost of unit time ) / ( speedup - 1)

The expensive heuristic function is saving wall time well also being cheaper in cost of unit time. And as the paper shows the speed up provided for each unit time multiplied by unit cost of time is large.

	▲	greeravoctado 29 minutes ago \| parent [-]
		Usually the rate of overall improvement for this type of optimization is less than Moore law rate of improvement, thus not worth the company investment. 17x micro-benchmarks don't count. Real improvements come from architectural changes, for example: MoE, speculative multi-token prediction, etc.

▲

bvcasd an hour ago | parent | prev | next [-]

having an agent that looks at the error + the isa spec and trys a fix automatically is worth its weight in gold. turns a frustrating 'read the docs for 2 hours' session into a 5 min background task. thats the kind of QoL stuff that actually gets devs to adopt this. how close is this to being used in production?

▲

dfdsfds 40 minutes ago | parent | prev | next [-]

Very impressive results! Will be curious to see how correctness is guaranteed and what kind of failures are normal from the LLM-generated code

▲

maven5t 42 minutes ago | parent | prev | next [-]

tried using NKI a few months ago and the docs were rough. having the LLM just figure it out from the ISA spec is honestly genius

▲

pakt1 2 hours ago | parent | prev | next [-]

Trainium has always been a black box to me compared to GPUs. Seeing an automated tool reverse-engineer the best way to use the VectorEngine vs the TensorEngine is fascinating. It reveals just how much performance is left on the table by standard compilers.

▲

dksgmlwo 2 hours ago | parent | prev | next [-]

Fascinating. Having worked as a kernel engineer before, I know how impactful it is to reduce the initial exploration overhead. It can save a huge amount of the grunt work engineers typically have to do.

▲

taqpos 3 hours ago | parent | prev | next [-]

This post unintentionally highlights exactly why NVIDIA is untouchable. If you need a farm of H100s running GPT-5 just to figure out how to program Amazon's Trainium chip efficiently, the hardware abstraction is fundamentally broken.

▲

CobbledSteel 2 hours ago | parent [-]

I'd argue the logic goes the other way, if all it takes to get high performant kernels is to rent a GPU farm, that seems to undercut the years and millions of engineering hours required to build the NVIDIA SW infrastructure. High hopes for smaller players now

	▲	archipelago123 an hour ago \| parent [-]
		The fact that nobody cared to optimize kernels for these hardware platforms proves Nvidia's CUDA moat, especially now that squeezing performance has become so important for serving inference. Hardware ISA is broken => nobody knows how to program the hardware => unoptimized kernels => nobody will use your hardware. Also, bad baselines present opportunities for LLMs to optimize for. Indeed, the kernel that achieved a 17X speedup seems to be a conv1d, which AWS could not care less about optimizing.

▲

yrh 2 hours ago | parent | prev | next [-]

Interesting read. I think the more "whitebox" approach with a laid out menu to choose from makes the resulting kernel more trustworthy, although it does ask the question if going outside of the predefined steps of optimization from time to time may yield insights.

▲

qat321 3 hours ago | parent | prev | next [-]

I wonder if these results extend beyond AWS Trainium?

▲

charleshong 2 hours ago | parent [-]

[dead]

▲

gfhsad 2 hours ago | parent [-]

Whenever I see '17x faster than experts,' I read 'the experts didn't actually try very hard on the baseline.'

▲

charleshong 2 hours ago | parent [-]

Well, most of our results are not 17x. But still (IMO) solid across the board!

Also, the 17x came from a pretty obscure fusion optimization that isn't called out anywhere in the documentation (we had to run the profiler to see what was actually going on). Wouldn't be surprised if whoever within AWS wrote the kernel didn't know about that optimization.

	▲	snklt 2 hours ago \| parent [-]
		17x is a wild improvement regardless of the baseline. Impressive results.

▲

bgwalter an hour ago | parent | prev | next [-]

So, Trainium is an architecture that requires brute force to write software for.

Maybe if we invest $100 trillion in data centers, we can rewrite the Linux Kernel in Malbolge.

▲

melissapan 2 hours ago | parent | prev | next [-]

ADRS <> Compiler: what if your “compiler” could think?

▲

igorpcosta 2 hours ago | parent | prev | next [-]

Very interesting research on this, keen to colab with you folks, I've been building a few experiments for old GTX GPUs to extend lifetime of them with matching performance of tokens for Smol, igor [] autohand.ai let's chat.

▲

chanwutk 2 hours ago | parent | prev | next [-]

Very interesting read!

▲

2 hours ago | parent | prev | next [-]

[deleted]

▲

maltese669 2 hours ago | parent | prev | next [-]

ngl letting AI fiddle with the kernel sounds scary but the results are really impressive

▲

incin1234 2 hours ago | parent | prev | next [-]

[dead]

▲

syngrog66 an hour ago | parent | prev [-]

AI has told me that Biden was preparing for his upcoming debate with Trump. It told me that in May 2025.

AI has told me its not raining in my city and that in fact there was 0% chance of it that day. As I was looking out my open front door watching a heavy downpour.

	▲	DroneBetter an hour ago \| parent [-]
		that is an indictment of the implementations, not the fundamental limits of the architecture; most commercial LLMs now have web-searching available by default and can do both of those things, but couldn't when they were confined to the user's prompt and their training data (which was often not quite contemporary, until recently)