▲ | NitpickLawyer 7 days ago | |
> but LLMs have really started to plateau off on their capabilities haven’t they? Uhhh, no? In the past month we've had: - LLMs (3 different models) getting gold at IMO - gold at IoI - beat 9/10 human developers at atcode heuristics (optimisations problems) with the single human that actually beat the machine saying he was exhausted and next year it'll probably be over. - agentic that actually works. And works for 30-90 minute sessions while staying coherent and actually finishing tasks. - 4-6x reduction in price for top tier (SotA?) models. oAI's "best" model now costs 10$/MTok, while retaining 90+% of their previous SotA models that were 40-60$/MTok. - several "harnesses" being released by every model provider. Claude code seems to remain the best, but alternatives are popping off everywhere - geminicli, opencoder, qwencli (forked, but still), etc. - opensource models that are getting close to SotA, again. Being 6-12months behind (depending on who you ask), opensource and cheap to run (~2$/MTok on some providers). I don't see the plateauing in capabilities. LLMs are plateauing only in benchmarks, where number goes up can only go up so far until it becomes useless. IMO regular benchmarks have become useless. MMLU & co are cute, but agentic whatever is what matters. And those capabilities have only improved. And will continue to improve, with better data, better signals, better training recipes. Why do you think eveyr model provider is heavily subsidising coding right now? They all want that sweet sweet data & signals, so they can improve their models. | ||
▲ | tripzilch 3 days ago | parent [-] | |
> I don't see the plateauing in capabilities. LLMs are plateauing only in benchmarks Don't you mean the opposite? Like, it beat an IMO, which is a benchmark, but it's nowhere remotely close to having any of even the basic mathematical capabilities someone who beat an IMO can be expected to have. Like being unable to deal with negations ... or not getting confused by a question being stated in something other than their native alphabet ... |