I think of some of the ways LLMs perform better in real life than they do in evals.

For instance I ask AI assistants a lot about what some code is trying to do in applications software where it is a matter of React, CSS and how APIs get used. Frequently this is a matter of pattern matching and doesn't require deep thought and I find LLMs often nail it.

When it comes to "what does some systems oriented code do" now you are looking at halting problem kind of problems or cases where a person will be hypnotized by an almost-bubble-sort to think it's a bubble sort and the LLM is too. You can certainly make code understanding benchmarks aimed at "whiteboard interview" kind of code that are arbitrarily complex, but that doesn't reflect the ability or inability to deal with "what is up with this API?"

▲

animuchan 3 months ago | parent [-]

I think what you're describing is, easy tasks are easy to perform.

Which is, of course, true. Anecdotally, a lot of value I get from Copilot is in simple, mundane tasks.

	▲	PaulHoule 3 months ago \| parent [-]
		I think easy tasks are basically "linear" in that you don't have interactions between components. If you do have interactions between components complexity gets out of control very quickly. Many practical problems for instance are NP-complete or undecidable. Many of them could be attacked by SMT or SAT but often you can solve them using tactics from math.