Remix.run Logo
dns_snek 4 hours ago

And how is this comment relevant here? The abstract lists the digestible model names, and you can find the details in the supplementary text:

> To evaluate user-facing production LLMs, we studied four proprietary models: OpenAI’s GPT-5 and GPT- 4o (80), Google’s Gemini-1.5-Flash (81) and Anthropic’s Claude Sonnet 3.7 (82); and seven open-weight models: Meta’s Llama-3-8B-Instruct, Llama-4-Scout-17B-16E, and Llama-3.3-70B-Instruct-Turbo (83, 84); Mistral AI’s Mistral-7B-Instruct-v0.3 (85) and Mistral-Small-24B-Instruct-2501 (86); DeepSeek-V3 (87); and Qwen2.5-7B-Instruct-Turbo (88).

edit: It looks like OP attached the wrong link to the paper!

The article is about this Stanford study: https://www.science.org/doi/10.1126/science.aec8352

But the link in OP's post points to (what seems to be) a completely unrelated study.

vorticalbox 3 hours ago | parent | next [-]

"OpenAI’s GPT-5" is ambiguous. Does that mean GPT-5, 5.1, 5.2, 5.3, or 5.4? Does it include the full model, or the nano/mini variants?

dns_snek 2 hours ago | parent [-]

GPT-5 is not ambiguous, it's the official name of the model that released in August last year.

> All evaluations were done in March - August 2025.

vorticalbox 44 minutes ago | parent [-]

while true, all the others got precise identifiers but for openAI it makes it hard to reproduce because i have no idea "which" GPT-5 was used.

zjp 4 hours ago | parent | prev [-]

Also, nothing has changed! Claude will still yes-and whatever you give it. ChatGPT still has its insufferable personality, where it takes what you said and hands it back to you in different terms as if it's ChatGPT's insight.

emp17344 3 hours ago | parent | next [-]

No dude, you don’t understand! It’s just so advanced now that you aren’t allowed to levy any criticism whatsoever!

TrainedMonkey 3 hours ago | parent | prev [-]

It's almost like it is based on the training data and regimen that is largely the same between versions.