Remix.run Logo
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs(arxiv.org)
49 points by matt_d 3 hours ago | 3 comments
rahen 2 hours ago | parent | next [-]

Strictly speaking, this is very domain-specific and doesn't enable any performance that Triton couldn't already achieve (eliminating global memory round-trips via epilogue fusion is nothing new). The real takeaway is the design shift for LLM-driven codegen rather than handcrafted kernels.

LLMs are still bad at low-level hardware optimizations, but really good at high-level composition. Designing compiler abstractions with a restricted, composable API so an LLM can easily glue expert-written blocks together is a smart move. I suspect this will eventually become the norm for codegens as we move to agentic development.

sroussey an hour ago | parent [-]

I imagine this is what’s already done for AI laying out hardware design.

maxignol an hour ago | parent | prev [-]

« LLMs can successfully author CODA kernels » That might speed up progress in this area then