Remix.run Logo
anorwell 5 hours ago

A pastime I have with papers like this is to look for the part in the paper where they say which models they tested. Very often, you find either A) it's a model from one or more years ago, only just being published now, or B) they don't even say which model they are using. Best I could find in this paper:

> We evaluated 11 user-facing production LLMs: four proprietary models from OpenAI, Anthropic, and Google; and seven open-weight models from Meta, Qwen, DeepSeek, and Mistral.

(and graphs include model _sizes_, but not versions, for open weight models only.)

I can't apprehend how including what model you are testing is not commonly understood to be a basic requirement.

dns_snek 4 hours ago | parent | next [-]

And how is this comment relevant here? The abstract lists the digestible model names, and you can find the details in the supplementary text:

> To evaluate user-facing production LLMs, we studied four proprietary models: OpenAI’s GPT-5 and GPT- 4o (80), Google’s Gemini-1.5-Flash (81) and Anthropic’s Claude Sonnet 3.7 (82); and seven open-weight models: Meta’s Llama-3-8B-Instruct, Llama-4-Scout-17B-16E, and Llama-3.3-70B-Instruct-Turbo (83, 84); Mistral AI’s Mistral-7B-Instruct-v0.3 (85) and Mistral-Small-24B-Instruct-2501 (86); DeepSeek-V3 (87); and Qwen2.5-7B-Instruct-Turbo (88).

edit: It looks like OP attached the wrong link to the paper!

The article is about this Stanford study: https://www.science.org/doi/10.1126/science.aec8352

But the link in OP's post points to (what seems to be) a completely unrelated study.

vorticalbox 3 hours ago | parent | next [-]

"OpenAI’s GPT-5" is ambiguous. Does that mean GPT-5, 5.1, 5.2, 5.3, or 5.4? Does it include the full model, or the nano/mini variants?

dns_snek 2 hours ago | parent [-]

GPT-5 is not ambiguous, it's the official name of the model that released in August last year.

> All evaluations were done in March - August 2025.

vorticalbox 43 minutes ago | parent [-]

while true, all the others got precise identifiers but for openAI it makes it hard to reproduce because i have no idea "which" GPT-5 was used.

zjp 4 hours ago | parent | prev [-]

Also, nothing has changed! Claude will still yes-and whatever you give it. ChatGPT still has its insufferable personality, where it takes what you said and hands it back to you in different terms as if it's ChatGPT's insight.

emp17344 3 hours ago | parent | next [-]

No dude, you don’t understand! It’s just so advanced now that you aren’t allowed to levy any criticism whatsoever!

TrainedMonkey 3 hours ago | parent | prev [-]

It's almost like it is based on the training data and regimen that is largely the same between versions.

zulban 4 hours ago | parent | prev | next [-]

Generally, published papers don't give a damn about reproducibility. I've seen it identified as a crisis by many. Publishers, reviewers, and researchers mostly don't care about that level of basic rigor. There's no professional repercussions or embarrassment.

Agreed - if I was a reviewer for LLM papers it would be an instant rejection not listing the versions and prompts used.

epistasis 4 hours ago | parent | next [-]

I'm not so sure of that opinion on reproducibility. The last peer review I did was for a small journal that explicitly does not evaluate for high scientific significance, merely for correctness, which generally means straightforward acceptance. The other two reviews were positive, as was mine, except I said that the methods need to be described more and ideally the code placed somewhere. That was enough for a complete rejection of the paper, without asking for the simple revisions I requested. It was a very serious action taken merely because I requested better reproducibility!

(Personally I think the lack of reproducibility comes back mostly to peer reviewers that haven't thought through enough about the steps they'd need to take to reproduce, and instead focus on the results...)

catlifeonmars 3 hours ago | parent | next [-]

> and instead focus on the results...

This points to (and everyone knows this) incentives misalignment between the funders of research and the public. Researchers are caught in the middle

epistasis 2 hours ago | parent [-]

Eh, I'm not so sure about the funding side there, researchers are not really caught at all and are fully responsible, IMHO. Peer reviewers exist to enforce community standards, and are not influenced to avoid reproducibility concerns by funding sources. The results are always more interesting than reproducibility, of course, and I think that's why the get the attention! Also, there needs to be greater involvement of grad students (who do most of the actual work) in peer review, IMHO, because most PIs spend their day in meetings reviewing results, setting directions, writing grants, and have little time for actual lab work, and are thus disconnected from it.

There needs to be more public naming and shaming in science social media and in conference talks, but especially when there are social gatherings at conferences and people are able to gossip. There was a bit of this with Google's various papers, as they got away with figurative murder on lack of reproducibility for commercial purposes. But eventually Google did share more.

Most journals have standards for depositing expensive datasets, but that's a clear yes/no answer. Reproducibility is a very subjective question in comparison to data deposition, and must be subjectively evaluated by peer reviewers. I'd like to see more peer review guidelines with explicit check boxes for various aspects of reproducibility.

zulban 2 hours ago | parent | prev [-]

I'm not sure how one example contradicts documented huge overall trends, but okay.

epistasis 2 hours ago | parent [-]

I think publishers care about this a lot, but most researchers do not seem to care as much about reproducibility.

inetknght 3 hours ago | parent | prev | next [-]

> Generally, published papers don't give a damn about reproducibility

While this is sadly true, it's especially true when talking about things that are stochastic in nature.

LLMs outputs, for example, are notoriously unreproducible.

zulban 2 hours ago | parent [-]

> LLMs outputs, for example, are notoriously unreproducible.

Only in the same way that an individual in a medical study cannot be "reproduced" for the next study. However the overall statistical outcomes of studying a specific LLM can be reproduced.

ghywertelling 3 hours ago | parent | prev | next [-]

The same about surveys and polls. I know no one who has ever been polled or surveyed. When will we stop this fascination with made up infographics crisis?

KellyCriterion 4 hours ago | parent | prev [-]

Do they reproduce any submitted papers at all?

Does this happen?

I can remember this room-temperature-super-conductor guy whose experiments where replicated, but this seems rare?

linhns 3 hours ago | parent [-]

Yes, those are the only papers that worth a jot of reading.

jameshart 2 hours ago | parent | prev | next [-]

I think it’s very important to be clear what studies like this are actually doing.

This study, although it has been produced by a computer science department, belongs more to the field of sociology or media studies than it does to computer science.

This is a study about the way in which human beings consume a particular media product - a consumer AI chatbot - not a study about the technological limitations or capabilities of LLMs.

The social impact of particular pieces of software is a legitimate field of study and I can see the argument that it belongs in the broadly defined field of computer science. But this sort of question is much more similar to ‘how does the adoption of spreadsheet software in finance impact the ease of committing fraud’ or ‘how does the use of presentation software to condense ideas down to bulletpoints impact organizational decision making’. Software has a social dimension and it needs to be examined.

But the question of which models were used is of much less relevance to such a study than that they used ‘whatever capability is currently offered to consumers who commonly use chat software’. Just like in a media studies investigation into how viewing cop dramas impacts jury verdicts the question is less ‘which cop dramas did they pick to study?’ So long as the ones they picked were representative of what typical viewers see.

drfloyd51 4 hours ago | parent | prev | next [-]

It’s as if they are testing “AI” and not specific agents.

I wonder if that is left over from testing people. I have major version numbers and my minor version number changes daily, often as a surprise. Sometimes several times a day. So testing people is a bit tricky. But AIs do have stable version numbers and can be specifically compared.

yacin 3 hours ago | parent | prev | next [-]

Any paper like this would easily take a year or more to write and go through the submission/review/rebuttal/revision/acceptance process. I don't understand why the models being a year or two old now is worth noting as though it's a clear weakness? What should they do, publish sub-standard results more quickly?

anorwell 3 hours ago | parent [-]

> I don't understand why the models being a year or two old now is worth noting as though it's a clear weakness?

I do think it's a clear weakness. Capabilities are extremely different than they were twelve months ago.

> What should they do, publish sub-standard results more quickly?

Ideally, publish quality results more quickly.

I'm quite open to competing viewpoints here, but it's my impression that academic publishing cycle isn't really contributing to the AI discussion in a substantive way. The landscape is just moving too quickly.

yacin 2 hours ago | parent [-]

The onus is on you to prove or at least convincingly argue that the results are unlikely to generalize across incremental model releases. In my personal experience, the overly affirming nature seems to have held since GPT-3. What makes you think a newer, larger model would not exhibit this behavior? Beyond "they're more capable"? I'd argue that being more capable doesn't mean less sycophantic.

It's certainly possible some of the new advances (chain-of-thought, some kind of agentic architecture) could lessen or remove this effect. But that's not what the paper was studying! And if you feel strongly about it, you could try to further the discussion with results instead of handwavingly dismissing others' work.

mkagenius an hour ago | parent [-]

I think you are absolutely right. (had to)

jmkni 4 hours ago | parent | prev | next [-]

How many people using AI are actually paying for it (outside of people in tech)?

I find the free models are much more psychophantic and have a higher tendency to hallucinate and just make shit up, and I wonder if these are the ones most people are using?

theshackleford 19 minutes ago | parent [-]

> I find the free models are much more psychophantic and have a higher tendency to hallucinate and just make shit up

I keep seeing this claim yet it my experience it doesnt hold water. I pay for the models, most people I know pay for the models, and we see all of the exact same issues.

I have Claude and ChatGPT both bullshit and lick my ass on the regular. The ass licking will occur regardless of instruction.

rco8786 5 hours ago | parent | prev [-]

If they’re reaching the same results across a variety of the most popular public models, it doesn’t seem like that big a deal to know if it was Opus 4 or Opus 4.5

hn_throwaway_99 4 hours ago | parent [-]

Reproducibility is (supposed to be) a cornerstone of science. Model versions are absolutely critical to understand what was actually tested and how to reproduce it.

joaogui1 4 hours ago | parent [-]

The models get deprecated after 1-2 years, so reproducibility is pretty hard anyway (but as others pointed out the paper does list the model versions)