| ▲ | gjadi 2 hours ago | |
Have this shortcomings of llms been addressed by better models or by better integration with other tools? Like, are they better at coding because the models are truly better or because the agentic loops are better designed? | ||
| ▲ | NitpickLawyer an hour ago | parent [-] | |
100% by better models. Since his talk models have gained more context windows (up to usable 1M), and RL (reinforcement learning) has been amazing at both picking out good traces, and taught the LLMs how to backtrack and overcome earlier wrong tokens. On top of that, RLAIF (RL with AI feedback) made earlier models better and RLVR (RL with verifiable rewards) has made them very good at both math and coding. The harnesses have helped in training the models themselves (i.e. every good trace was "baked in" the model) and have improved in enabling test time compute. But at the end of the day this is all put back into the models, and they become better. The simplest proof of this is on benchmarks like terminalbench and swe-bench with simple agents. The current top models are much better than their previous versions, when put in a loop with just a "bash tool". There's a ~100LoC harness called mini-swe-agent [1] that does just that. So current models + minimal loop >> previous gen models with human written harnesses + lots of glue. > Gemini 3 Pro reaches 74% on SWE-bench verified with mini-swe-agent! | ||