| ▲ | potatoman22 3 hours ago | |
Not to be nitpicky, but many of the 4-12b models are somewhere between GPT-3.5 and GPT-4o-mini. It's hard to find a good comparison though, because the benchmarks people score models against change so often. For reference, Sonnet 3.6 came out about a year after GPT 3.5 | ||
| ▲ | nl 2 hours ago | parent [-] | |
Don't worry about being nitpicky! I'm going to out-nitpick you.... Actually.... I write and publish my own benchmark for this stuff. It's an agentic SQL benchmark which isn't in the training data yet and I've found can separate frontier models from close-followers (the only models to get 100% are Opus 4.6 and GPT 5.5). The best small model I've found is a fine-tune of Opus-3.5 9B which scores 18/25: https://sql-benchmark.nicklothian.com/?highlight=Jackrong_Qw... Haiku 4.5 scores 20/25, and Haiku is certainly better than Sonnet 3.6. GPT 3.5 scores 13/25. | ||