| ▲ | _peregrine_ 7 months ago | ||||||||||||||||||||||
Already test Opus 4 and Sonnet 4 in our SQL Generation Benchmark (https://llm-benchmark.tinybird.live/) Opus 4 beat all other models. It's good. | |||||||||||||||||||||||
| ▲ | XCSme 7 months ago | parent | next [-] | ||||||||||||||||||||||
It's weird that Opus4 is the worst at one-shot, it requires on average two attempts to generate a valid query. If a model is really that much smarter, shouldn't it lead to better first-attempt performance? It still "thinks" beforehand, right? | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | stadeschuldt 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||
Interestingly, both Claude-3.7-Sonnet and Claude-3.5-Sonnet rank better than Claude-Sonnet-4. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | Workaccount2 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||
This is a pretty interesting benchmark because it seems to break the common ordering we see with all the other benchmarks. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | ineedaj0b 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||
i pay for claude premium but actually use grok quite a bit, the 'think' function usually gets me where i want more often than not. odd you don't have any xAI models listed. sure grok is a terrible name but it surprises me more often. i have not tried the $250 chatgpt model yet though, just don't like openAI practices lately. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | gkfasdfasdf 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||
Just curious, how do you know your questions and the SQL aren't in the LLM training data? Looks like the benchmark questions w/SQL are online (https://ghe.clickhouse.tech/). | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | sagarpatil 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||
Sonnet 3.7 > Sonnet 4? Interesting. | |||||||||||||||||||||||
| ▲ | dcreater 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||
How does Qwen3 do on this benchmark? | |||||||||||||||||||||||
| ▲ | mritchie712 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||
looks like this is one-shot generation right? I wonder how much the results would change with a more agentic flow (e.g. allow it to see an error or select * from the_table first). sonnet seems particularly good at in-session learning (e.g. correcting it's own mistakes based on a linter). | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | jpau 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||
Interesting! Is there anything to read into needing twice the "Avg Attempts", or is this column relatively uninteresting in the overall context of the bench? | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | XCSme 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||
That's a really useful benchmark, could you add 4.1-mini? | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | jjwiseman 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||
Please add GPT o3. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | varunneal 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||
Why is o3-mini there but not o3? | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | joelthelion 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||
Did you try Sonnet 4? | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | kadushka 7 months ago | parent | prev [-] | ||||||||||||||||||||||
what about o3? | |||||||||||||||||||||||
| |||||||||||||||||||||||