| ▲ | jasonjmcghee an hour ago |
| Interesting selection of models for the "instruction count vs. accuracy" plot. Curious when that was done and why they chose those models. How well does ChatGPT 5/5.1 (and codex/mini/nano variants), Gemini 3, Claude Haiku/Sonnet/Opus 4.5, recent grok models, Kimi 2 Thinking etc (this generation of models) do? |
|
| ▲ | alansaber an hour ago | parent [-] |
| Guessing they included some smaller models just to show how they dump accuracy at smaller context sizes |
| |
| ▲ | jasonjmcghee an hour ago | parent [-] | | Sure - I was more commenting that they are all > 6 months old, which sounds silly, but things have been changing fast, and instruction following is definitely an area that has been developing a lot recently. I would be surprised if accuracy drops off that hard still. | | |
| ▲ | 0xblacklight 15 minutes ago | parent [-] | | I imagine it’s highly-correlated to parameter count, but the research is a few months old and frontier model architecture is pretty opaque so hard to draw too too many conclusions about newer models that aren’t in the study besides what I wrote in the post |
|
|