▲ | PaulHoule 2 days ago | |||||||
I think of some of the ways LLMs perform better in real life than they do in evals. For instance I ask AI assistants a lot about what some code is trying to do in applications software where it is a matter of React, CSS and how APIs get used. Frequently this is a matter of pattern matching and doesn't require deep thought and I find LLMs often nail it. When it comes to "what does some systems oriented code do" now you are looking at halting problem kind of problems or cases where a person will be hypnotized by an almost-bubble-sort to think it's a bubble sort and the LLM is too. You can certainly make code understanding benchmarks aimed at "whiteboard interview" kind of code that are arbitrarily complex, but that doesn't reflect the ability or inability to deal with "what is up with this API?" | ||||||||
▲ | animuchan 2 days ago | parent [-] | |||||||
I think what you're describing is, easy tasks are easy to perform. Which is, of course, true. Anecdotally, a lot of value I get from Copilot is in simple, mundane tasks. | ||||||||
|