Remix.run Logo
throwup238 7 hours ago

> Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

Agreed. I think the problem is that while they can innovate at algorithms and training efficiency, the human part of RLHF just doesn't scale and they can't afford the massive amount of custom data created and purchased by the frontier labs.

IIRC it was the application of RLHF which solved a lot of the broken syntax generated by LLMs like unbalanced braces and I still see lots of these little problems in every open source model I try. I don't think I've seen broken syntax from the frontier models in over a year from Codex or Claude.

algorithm314 7 hours ago | parent | next [-]

Can't they just run the output through a compiler to get feedback? Syntax errors seem easier to get right.

NitpickLawyer 6 hours ago | parent | next [-]

The difference is in scaling. The top US labs have oom more compute available than chinese labs. The difference in general tasks is obvious once you use them. It used to be said that open models are ~6mo behind SotA a year go, but with the new RL paradigm, I'd say the gap is growing. With less compute they have to focus on narrow tasks, resort to poor man's distillation and that leads to models that show benchmaxxing behavior.

That being said, this model is MIT licensed, so it's a net benefit regardless of being benchmaxxed or not.

rockinghigh 6 hours ago | parent | prev [-]

They do. Pretty much all agentic models call linting, compiling and testing tools as part of their flow.

ej88 7 hours ago | parent | prev [-]

the new meta is purchasing rl environments where models can be self-corrected (e.g. a compiler will error) after sft + rlhf ran into diminishing returns. although theres still lots of demand for "real world" data for actually economically valuable tasks