Remix.run Logo
_peregrine_ 7 months ago

Already test Opus 4 and Sonnet 4 in our SQL Generation Benchmark (https://llm-benchmark.tinybird.live/)

Opus 4 beat all other models. It's good.

XCSme 7 months ago | parent | next [-]

It's weird that Opus4 is the worst at one-shot, it requires on average two attempts to generate a valid query.

If a model is really that much smarter, shouldn't it lead to better first-attempt performance? It still "thinks" beforehand, right?

riwsky 7 months ago | parent [-]

Don’t talk to Opus before it’s had its coffee. Classic high-performer failure mode.

stadeschuldt 7 months ago | parent | prev | next [-]

Interestingly, both Claude-3.7-Sonnet and Claude-3.5-Sonnet rank better than Claude-Sonnet-4.

_peregrine_ 7 months ago | parent [-]

yeah that surprised me too

Workaccount2 7 months ago | parent | prev | next [-]

This is a pretty interesting benchmark because it seems to break the common ordering we see with all the other benchmarks.

_peregrine_ 7 months ago | parent [-]

Yeah I mean SQL is pretty nuanced - one of the things we want to improve in the benchmark is how we measure "success", in the sense that multiple correct SQL results can look structurally dissimilar while semantically answering the prompt.

There's some interesting takeaways we learned here after the first round: https://www.tinybird.co/blog-posts/we-graded-19-llms-on-sql-...

ineedaj0b 7 months ago | parent | prev | next [-]

i pay for claude premium but actually use grok quite a bit, the 'think' function usually gets me where i want more often than not. odd you don't have any xAI models listed. sure grok is a terrible name but it surprises me more often. i have not tried the $250 chatgpt model yet though, just don't like openAI practices lately.

timmytokyo 7 months ago | parent [-]

Not saying you're wrong about "OpenAI practices", but that's kind of a strange thing to complain about right after praising an LLM that was only recently inserting claims of "white genocide" into every other response.

veidr 7 months ago | parent [-]

For real, though.

Even if you don't care about racial politics, or even good-vs-evil or legal-vs-criminal, the fact that that entire LLM got (obviously, and ineptly) tuned to the whim of one rich individual — even if he wasn't as creepy as he is — should be a deal-breaker, shouldn't it?

gkfasdfasdf 7 months ago | parent | prev | next [-]

Just curious, how do you know your questions and the SQL aren't in the LLM training data? Looks like the benchmark questions w/SQL are online (https://ghe.clickhouse.tech/).

zarathustreal 7 months ago | parent [-]

“Your model has memorized all knowledge, how do you know it’s smart?”

sagarpatil 7 months ago | parent | prev | next [-]

Sonnet 3.7 > Sonnet 4? Interesting.

dcreater 7 months ago | parent | prev | next [-]

How does Qwen3 do on this benchmark?

mritchie712 7 months ago | parent | prev | next [-]

looks like this is one-shot generation right?

I wonder how much the results would change with a more agentic flow (e.g. allow it to see an error or select * from the_table first).

sonnet seems particularly good at in-session learning (e.g. correcting it's own mistakes based on a linter).

_peregrine_ 7 months ago | parent [-]

Actually no, we have it up to 3 attempts. In fact, Opus 4 failed on 36/50 tests on the first attempt, but it was REALLY good at nailing the second attempt after receiving error feedback.

jpau 7 months ago | parent | prev | next [-]

Interesting!

Is there anything to read into needing twice the "Avg Attempts", or is this column relatively uninteresting in the overall context of the bench?

_peregrine_ 7 months ago | parent [-]

No it's definitely interesting. It suggests that Opus 4 actually failed to write proper syntax on the first attempt, but given feedback it absolutely nailed the 2nd attempt. My takeaway is that this is great for peer-coding workflows - less "FIX IT CLAUDE"

XCSme 7 months ago | parent | prev | next [-]

That's a really useful benchmark, could you add 4.1-mini?

_peregrine_ 7 months ago | parent [-]

Yeah we're always looking for new models to add

jjwiseman 7 months ago | parent | prev | next [-]

Please add GPT o3.

_peregrine_ 7 months ago | parent [-]

Noted, also feel free to add an issue to the GitHub repo: https://github.com/tinybirdco/llm-benchmark

varunneal 7 months ago | parent | prev | next [-]

Why is o3-mini there but not o3?

_peregrine_ 7 months ago | parent [-]

We should definitely add o3 - probably will soon. Also looking at testing the Qwen models

joelthelion 7 months ago | parent | prev | next [-]

Did you try Sonnet 4?

vladimirralev 7 months ago | parent [-]

It's placed at 10. Below claude-3.5-sonnet, GPT 4.1 and o3-mini.

_peregrine_ 7 months ago | parent | next [-]

yeah this was a surprising result. of course, bear in mind that testing an LLM on SQL generation is pretty nuanced, so take everything with a grain of salt :)

TacticalCoder 7 months ago | parent | prev [-]

[dead]

kadushka 7 months ago | parent | prev [-]

what about o3?

_peregrine_ 7 months ago | parent [-]

We need to add it