CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Strictly speaking, this is very domain-specific and doesn't enable any performance that Triton couldn't already achieve (eliminating global memory round-trips via epilogue fusion is nothing new). The real takeaway is the design shift for LLM-driven codegen rather than handcrafted kernels.

LLMs are still bad at low-level hardware optimizations, but really good at high-level composition. Designing compiler abstractions with a restricted, composable API so an LLM can easily glue expert-written blocks together is a smart move. I suspect this will eventually become the norm for codegens as we move to agentic development.

	▲	sroussey an hour ago \| parent [-]
		I imagine this is what’s already done for AI laying out hardware design.

▲

maxignol an hour ago | parent | prev [-]

« LLMs can successfully author CODA kernels » That might speed up progress in this area then