▲ | Aurornis 4 days ago | ||||||||||||||||||||||||||||
Evals are a core part of any up to date LLM team. If some team was just winging it without robust eval practices they’re not to be trusted. > Further more, I've worked with a pretty well respected researcher in this space and in our internal experiment we found that LLMs where not good critics This is an idea that seems so obvious in retrospect, after using LLMs and getting so many flattering responses telling us we’re right and complementing our inputs. For what it’s worth, I’ve heard from some people who said they were getting better results by intentionally using different LLM models for the eval portion. Feels like having a model in the same family evaluate its own output triggers too many false positives. | |||||||||||||||||||||||||||||
▲ | Uehreka 4 days ago | parent [-] | ||||||||||||||||||||||||||||
I once asked Claude Code (Opus 4) to review a codebase I’d built, and threw in at the end of my prompt something like “No need to be nice about it.” Now granted, you could say it was “flattering that instruction”, but it sure didn’t flatter me. It absolutely eviscerated my code, calling out numerous security issues (which were real), all manner of code smells and bad architectural decisions, and ended by saying that the codebase appeared to have been thrown together in a rush with no mind toward future maintenance (which was… half true… maybe more true than I’d like to admit). All this to say that it is far from obvious that LLMs are intrinsically bad critics. | |||||||||||||||||||||||||||||
|