Remix.run Logo
GPT-5.2(openai.com)
351 points by atgctg 2 hours ago | 177 comments

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

zone411 36 minutes ago | parent | next [-]

I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):

The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.

The medium-reasoning version also improves: 62.7 → 72.1.

The no-reasoning version also improves: 22.1 → 27.5.

Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher.

Donald 25 minutes ago | parent [-]

Gemini 3 Pro Preview gets 96.8% on the same benchmark? That's impressive

capitainenemo 21 minutes ago | parent | next [-]

And performs very well on the latest 100 puzzles too, so isn't just learning the data set (unless I guess they routinely index this repo).

I wonder how well AIs would do at bracket city. I tried gemini on it and was underwhelmed. It made a lot of terrible connections and often bled data from one level into the next.

bigyabai 3 minutes ago | parent | prev [-]

GPT-5.2 might be Google's best Gemini advertisement yet.

simonw an hour ago | parent | prev | next [-]

Wow, there's a lot going on with this pelican riding a bicycle: https://gist.github.com/simonw/c31d7afc95fe6b40506a9562b5e83...

fxwin a minute ago | parent | next [-]

the only benchmark i trust

minimaxir an hour ago | parent | prev | next [-]

Is that the first SVG pelican with drop shadows?

simonw 41 minutes ago | parent [-]

No, I got drop shadows from DeepSeek 3.2 recently https://simonwillison.net/2025/Dec/1/deepseek-v32/ (probably others as well.)

belter 14 minutes ago | parent | prev | next [-]

What happens if you ask for a pterodactyl on a motorbike?

Would like to know how much they are optimizing for your pelican....

tmaly 44 minutes ago | parent | prev [-]

seems to be eating something

danans 28 minutes ago | parent [-]

Probably a jellyfish. You're seeing the tentacles

jumploops an hour ago | parent | prev | next [-]

> “a new knowledge cutoff of August 2025”

This (and the price increase) points to a new pretrained model under-the-hood.

GPT-5.1, in contrast, was allegedly using the same pretraining as GPT-4o.

xd1936 2 hours ago | parent | prev | next [-]

> While GPT‑5.2 will work well out of the box in Codex, we expect to release a version of GPT‑5.2 optimized for Codex in the coming weeks.

https://openai.com/index/introducing-gpt-5-2/

jstummbillig 2 hours ago | parent [-]

> For coding tasks, GPT-5.1-Codex-Max is a faster, more capable, and more token-efficient coding variant

Hm, yeah, strange. You would not be able to tell, looking at every chart on the page. Obviously not a gotcha, they put it on the page themselves after all, but how does that make sense with those benchmarks?

tempaccount420 an hour ago | parent [-]

Coding requires a mindset shift that the -codex fine-tunes provide. Codex will do all kinds of weird stuff like poking in your ~/.cargo ~/go etc. to find docs and trying out code in isolation, these things definitely improve capability.

dmos62 33 minutes ago | parent [-]

The biggest advantage of codex variants, for me, is terseness and reduced sicophany. That, and presumably better adherence to requested output formats.

zug_zug 2 hours ago | parent | prev | next [-]

For me the last remaining killer feature of ChatGPT is the quality of the voice chat. Do any of the competitors have something like that?

whimsicalism 19 minutes ago | parent | next [-]

gemini does, grok does, nobody else does (except alibaba but it’s not there yet)

Robdel12 2 hours ago | parent | prev | next [-]

I have found Claude‘s voice chat to be better. I only recently tried it because I liked ChatGPTs enough, but I think I’m going to use Claude going forward. I find myself getting interrupted by ChatGPT a lot whenever I do use it.

lxgr 2 hours ago | parent [-]

Claude’s voice chat isn’t “native” though, is it? It feels like it’s speech-to-text-to-LLM and back.

sosodev 2 hours ago | parent [-]

You can test it by asking it to: change the pitch of its voice, make specific sounds (like laughter), differentiate between words that are spelled the same but pronounced differently (record and record), etc.

lxgr an hour ago | parent [-]

Good idea, but an external “bolted on” LLM-based TTS would still pass that in many cases, right?

sosodev an hour ago | parent | next [-]

Yes, a sufficiently advanced marrying of TTS and LLM could pass a lot of these tests. That kind of blurs the line between native voice model and not though.

You would need:

* A STT (ASR) model that outputs phonetics not just words

* An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc

* A TTS model that understands those tokens and properly generate the matching voice

At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them.

barrkel an hour ago | parent | prev | next [-]

The model giving it text to speak would have to annotate the text in order for the TTS to add the affect. The TTS wouldn't "remember" such instructions from a speech to text stage previously.

jablongo an hour ago | parent | prev [-]

I tried to make ChatGPT sing Mary had a little lamb recently and it's atonal but vaguely resembles the melody, which is interesting.

websiteapi an hour ago | parent | prev | next [-]

gemini live is a thing - never tried chaptgpt, are they not similar?

spudlyo 31 minutes ago | parent | next [-]

Not for my use case. I can open it up, and in restored classical Latin pronunciation say "Hi, my name is X, how are you?" and it will respond (also in Latin) "Hello X, I am well, thanks for asking. I hope you are doing great." Its pronunciation is not great, but intelligible. In the written transcript, it butchers what I say, but its responses look good, although sans macrons indicating phonemic vowel length.

Gemini responds in what I think is Spanish, or perhaps Portuguese.

However I can hand an 8 minute long 48k mono mp3 of a nuanced Latin speaker who nasalizes his vowels, and makes regular use of elision to Gemini-3-pro-preview and it will produce an accurate macronized Latin transcription. It's pretty mind blowing.

Dilettante_ 8 minutes ago | parent [-]

I have to ask: What usecase requires you to speak Latin to the llm?

spudlyo 2 minutes ago | parent [-]

I'm a Latin language learner, and part of developing fluency is practicing extemporaneous speech. My dog is a patient listener, but a poor interlocutor. There are Latin language Discord servers where you can speak to people, but I don't quite have the confidence to do that yet. I assume the machine doesn't judge my shitty grammar.

jeanlucas an hour ago | parent | prev [-]

no.

nickvec 41 minutes ago | parent | next [-]

What? The voice chat is basically identical on ChatGPT and Gemini AFAICT.

leaK_u an hour ago | parent | prev [-]

how.

CamelCaseName 44 minutes ago | parent [-]

I find ChatGPT's voice to text to be the absolute best in the world, nearly perfect.

I have constant frustrations with Gemini voice to text misunderstanding what I'm saying or worse, immediately sending my voice note when I pause or breathe even though I'm midway through a sentence.

tmaly 43 minutes ago | parent | prev | next [-]

I can't keep up with half the new features all the model companies keep rolling out. I wish they would solve that

sundarurfriend 33 minutes ago | parent | prev | next [-]

Are you saying ChatGPT's voice chat is of good quality? Because for me it's one of its most frustrating weaknesses. I vastly prefer voice input to typing, and would love it if the voice chat mode actually worked well.

But apart from the voices being pretty meh, it's also really bad at detecting and filtering out noise, taking vehicle sounds as breaks to start talking in (even if I'm talking much louder at the same time) or as some random YouTube subtitles (car motor = "Thanks for watching, subscribe!").

The speech-to-text is really unreliable (the single-chat Dictate feature gets about 98% of my words correct, this Voice mode is closer to 75%), and they clearly use an inferior model for the AI backend for this too: with the same question asked in this back-and-forth Voice mode and a normal text chat, the answer quality difference is quite stark: the Voice mode answer is most often close to useless. It seems like they've overoptimized it for speed at the cost of quality, to the extent that it feels like it's a year behind in answer reliability and usefulness.

To your question about competitors, I've recently noticed that Grok seems to be much better at both the speech-to-text part and the noise handling, and the voices are less uncanny-valley sounding too. I'd say they also don't have that stark a difference between text answers and voice mode answers, and that would be true but unfortunately mainly because its text answers are also not great with hallucinations or following instructions.

So Grok has the voice part figured out, ChatGPT has the backend AI reliability figured out, but neither provide a real usable voice mode right now.

ivape 23 minutes ago | parent | prev | next [-]

I'm a big user of Gemini voice. My sense is that Gemini voice uses very tight system prompts that are designed to give you an answer and kind of get you off the phone as much as possible. It doesn't have large context at all.

That's how I judge quality at least. The quality of the actual voice is roughly the same as ChatGPT, but I notice Gemini will try to match your pitch and tone and way of speaking.

Edit: But it looks like Gemini Voice has been replaced with voice transcription in the mobile app? That was sudden.

codybontecou 2 hours ago | parent | prev | next [-]

Their voice agent is handy. Currently trying to build around it.

semiinfinitely 29 minutes ago | parent | prev | next [-]

try gemini voice chat

FrasiertheLion 2 hours ago | parent | prev | next [-]

Try elevenlabs

sosodev 2 hours ago | parent [-]

Does elevenlabs have a real-time conversational voice model? It seems like like their focus is largely on text to speech and speech to text. Which can approximate that type of thing but it's not at all the same as the native voice to voice that 4o does.

hi_im_vijay 20 minutes ago | parent | next [-]

[disclaimer, i work at elevenlabs] we specifically went with a cascading model for our agents platform because it's better suited for enterprise use cases where they have full control over the brain and can bring their own llm. with that said, even with a cascading model, we can capture a decent amount of nuance with our asr model, and it also supports capturing audio events like laughter or coughing.

a true speech to speech conversational model will perform better on things like capturing tone, pronouncations, phonetics, etc, but i do believe we'll also get better at that on the asr side over time.

dragonwriter an hour ago | parent | prev [-]

> Does elevenlabs have a real-time conversational voice model?

Yes.

> It seems like like their focus is largely on text to speech and speech to text.

They have two main broad offerings (“Platforms”); you seem to be looking at what they call the “Creative Platform”. The real-time conversational piece is the centerpiece of the “Agents Platform”.

sosodev an hour ago | parent [-]

It specifically says in the architecture docs for the agents platform that it's STT (ASR) -> LLM -> TTS

https://elevenlabs.io/docs/agents-platform/overview#architec...

bigyabai 2 hours ago | parent | prev [-]

Qwen does.

sosodev 2 hours ago | parent [-]

Qwen's voice chat is nowhere near as good as ChatGPT's.

preetamjinka 2 hours ago | parent | prev | next [-]

It's actually more expensive than GPT-5.1. I've gotten used to prices going down with each latest model, but this time it's gone up.

https://platform.openai.com/docs/pricing

kingstnap 34 minutes ago | parent | next [-]

Flagship models have rarely being cheaper, and especially not on release day. Only a few cases of this really.

Notable exceptions are Deepseek 3.2 and Opus 4.5 and GPT 3.5 Turbo.

The price drops usually are the form of flash and mini models being really cheap and fast. Like when we got o4 mini or 2.0 flash which was a particularly significant one.

PhilippGille an hour ago | parent | prev | next [-]

Gemini 3 Pro Preview also got more expensive than 2.5 Pro.

2.5 Pro: $1.25 input, $10 output (million tokens)

3 Pro Preview: $2 input, $12 output (million tokens)

endorphine 35 minutes ago | parent | prev | next [-]

Reading this comment, it just occurred to me that we're still in the first phase of the enshittification process.

moralestapia an hour ago | parent | prev | next [-]

Previous model's prices usually go down, but their flagship has always been the most expensive one.

Handy-Man 2 hours ago | parent | prev [-]

It also seems much more "smarter" though

josalhor 2 hours ago | parent | prev | next [-]

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

verdverm 2 hours ago | parent | next [-]

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

quantumHazer 2 hours ago | parent | next [-]

Seems pretty false if you look at the model card and web site of Opus 4.5 that is… (check notes) their latest model.

verdverm 2 hours ago | parent [-]

Building a good model generally means it will do well on benchmarks too. The point of the speculation is that Anthropic is not focused on benchmaxxing which is why they have models people like to use for their day-to-day.

I use Gemini, Anthropic stole $50 from me (expired and kept my prepaid credits) and I have not forgiven them yet for it, but people rave about claude for coding so I may try the model again through Vertex Ai...

The person who made the speculation I believe was more talking about blog posts and media statements than model cards. Most ai announcements come with benchmark touting, Anthropic supposedly does less / little of this in their announcements. I haven't seen or gathered the data to know what is truth

elcritch 31 minutes ago | parent [-]

You could try Codex cli. I prefer it over Claude code now, but only slightly.

verdverm 21 minutes ago | parent [-]

No thanks, not touching anything Oligarchy Altman is behind

stego-tech an hour ago | parent | prev | next [-]

These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

verdverm 41 minutes ago | parent | next [-]

I'm not sure, here's my anecdotal counter example, was able to get gemini-2.5-flash, in two turns, to understand and implement something I had done separately first, and it found another bug (also that I had fixed, but forgot was in this path)

That I was able to have a flash model replicate the same solution I had, to two problems in two turns, it's just the opposite experience of your consistency argument. I'm using tasks I've already solved as the evals while developing my custom agentic setup (prompts/tools/envs). They are able to do more of them today then they were even 6-12 months ago (pre-thinking models).

https://bsky.app/profile/verdverm.com/post/3m7p7gtwo5c2v

stego-tech 34 minutes ago | parent [-]

And therein lies the rub for why I still approach this technology with caution, rather than charge in full steam ahead: variable outputs based on immensely variable inputs.

I read stories like yours all the time, and it encourages me to keep trying LLMs from almost all the major vendors (Google being a noteworthy exception while I try and get off their platform). I want to see the magic others see, but when my IT-brain starts digging in the guts of these things, I’m always disappointed at how unstructured and random they ultimately are.

Getting back to the benchmark angle though, we’re firmly in the era of benchmark gaming - hence my quip about these things failing “the only benchmark that matters.” I meant for that to be interpreted along the lines of, “trust your own results rather than a spreadsheet matrix of other published benchmarks”, but I clearly missed the mark in making that clear. That’s on me.

verdverm 25 minutes ago | parent [-]

I mean more the guts of the agentic systems. Prompts, tool design, state and session management, agent transfer and escalation. I come from devops and backend dev, so getting in at this level, where LLMs are tasked and composed, is more interesting.

If you are only using provider LLM experiences, and not something specific to coding like copilot or Claude code, that would be the first step to getting the magic as you say. It is also not instant. It takes time to learn any new tech, this one has a above average learning curve, despite the facade and hype of how it should just be magic

Once you find the stupid shit in the vendor coding agents, like all us it/devops folks do eventually, you can go a level down and build on something like the ADK to bring your expertise and experience to the building blocks.

For example, I am now implementing environments for agents based on container layers and Dagger, which unlocks the ability to cheaply and reproducible clone what one agent was doing and have a dozen variations iterate on the next turn. Real useful for long term training data and evals synth, but also for my own experimentation as I learn how to get better at using these things. Another thing I did was change how filesystem operations look to the agent, in particular file reads. I did this to save context & money (finops), after burning $5 in 60s because of an error in my tool implementation. Instead of having them as message contents, they are now injected into the system prompt. Doing so made it trivial to add a key/val "cache" for the fun of it, since I could now inject things into the system prompt and let the agent have some control over that process through tools. Boy has that been interesting and opened up some research questions in my mind

snet0 11 minutes ago | parent | prev [-]

To say that a model won't solve a problem is unfair. Claude Code, with Opus 4.5, has solved plenty of problems for me.

If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.

jacquesm 7 minutes ago | parent [-]

That means you're probably asking it to do very simple things.

brokensegue an hour ago | parent | prev | next [-]

how do you quantitatively measure day-to-day quality? only thing i can think is A/B tests which take a while to evaluate

verdverm an hour ago | parent [-]

more or less this, but also synthetic

if you think about GANs, it's all the same concept

1. train model (agent)

2. train another model (agent) to do something interesting with/to the main model

3. gain new capabilities

4. iterate

You can use a mix of both real and synthetic chat sessions or whatever you want your model to be good at. Mid/late training seems to be where you start crafting personality and expertises.

Getting into the guts of agentic systems has me believing we have quite a bit of runway for iteration here, especially as we move beyond single model / LLM training. I still need to get into what all is de jour in the RL / late training, that's where a lot of opportunity lies from my understanding so far

Nathan Lambert (https://bsky.app/profile/natolambert.bsky.social) from Ai2 (https://allenai.org/) & RLHF Book (https://rlhfbook.com/) has a really great video out yesterday about the experience training Olmo 3 Think

https://www.youtube.com/watch?v=uaZ3yRdYg8A

HDThoreaun an hour ago | parent | prev | next [-]

Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.

mrandish 8 minutes ago | parent | next [-]

It is similar to an IQ test in that it evaluates abstract reasoning, however unlike IQ tests, it's been designed specifically for evaluating deeper reasoning in LLMs, including being resistant to LLMs 'training to the test'. If you read Francois' papers, he's well aware of the challenge and has done valuable work toward this goal.

Dismissing ARC-AGI as 'just an IQ test' doesn't seem justified because most SOTA LLMs have been performing well on traditional IQ tests for a while, yet SOTA LLMs did terribly on ARC-AGI-1 and showed almost no improvement for ~4 years - while knocking down virtually every other benchmark test. I suspect this is because ARC-AGI's design uniquely trying to resist not only 'training on the answers' but also 'training on the concepts', is working to some degree.

CamperBob2 an hour ago | parent | prev [-]

Exactly. In principle, at least, the only way to overfit to Arc-AGI is to actually be that smart.

Edit: if you disagree, try actually TAKING the Arc-AGI 2 test, then post.

npinsker 36 minutes ago | parent | next [-]

Completely false. This is like saying being good at chess is equivalent to being smart.

Look no farther than the hodgepodge of independent teams running cheaper models (and no doubt thousands of their own puzzles, many of which surely overlap with the private set) that somehow keep up with SotA, to see how impactful proper practice can be.

The benchmark isn’t particularly strong against gaming, especially with private data.

CamperBob2 3 minutes ago | parent [-]

Completely false. This is like saying being good at chess is equivalent to being smart.

No, it isn't. Go take the test yourself and you'll understand how wrong that is. Arc-AGI is intentionally unlike any other benchmark.

esafak 32 minutes ago | parent | prev [-]

I would not be so sure. You can always prep to the test.

HDThoreaun 4 minutes ago | parent [-]

How do you prep for arc agi? If the answer is just "get really good at pattern recognition" I do not see that as a negative at all.

Mistletoe 2 hours ago | parent | prev [-]

How do you measure whether it works better day to day without benchmarks?

bulbar 2 hours ago | parent | next [-]

Manually labeling answers maybe? There exist a lot of infrastructure built around and as it's heavily used for 2 decades and it's relatively cheap.

That's still benchmarking of course, but not utilizing any of the well known / public ones.

verdverm 2 hours ago | parent | prev | next [-]

Internal evals, Big AI certainly has good, proprietary training and eval data, it's one reason why their models are better

aydyn an hour ago | parent [-]

Then publish the results of those internal evals. Public benchmark saturation isn't an excuse to be un-quantitative.

verdverm an hour ago | parent [-]

How would published numbers be useful without knowing what the underlying data being used to test and evaluate them are? They are proprietary for a reason

To think that Anthropic is not being intentional and quantitative in their model building, because they care less for the saturated benchmaxxing, is to miss the forest for the trees

aydyn 5 minutes ago | parent [-]

Do you know everything that exists in public benchmarks?

They can give a description of what their metrics are without giving away anything proprietary.

standardUser 2 hours ago | parent | prev [-]

Subscriptions.

mrguyorama 43 minutes ago | parent [-]

Ah yes, humans are famously empirical in their behavior and we definitely do not have direct evidence of the "best" sports players being much more likely than the average to be superstitious or do things like wear "lucky underwear" or buy right into scam bracelets that "give you more balance" using a holographic sticker.

minimaxir 2 hours ago | parent | prev | next [-]

Note that GPT 5.2 newly supports a "xhigh" reasoning level, which could explain the better benchmarks.

It'll be noteworthy to see the cost-per-task on ARC AGI v2.

granzymes an hour ago | parent | next [-]

> It'll be noteworthy to see the cost-per-task on ARC AGI v2.

Already live. gpt-5.2-pro scores a new high of 54.2% with a cost/task of $15.72. The previous best was Gemini 3 Pro (54% with a cost/task of $30.57).

The best bang-for-your-buck is the new xhigh on gpt-5.2, which is 52.9% for $1.90, a big improvement on the previous best in this category which was Opus 4.5 (37.6% for $2.40).

https://arcprize.org/leaderboard

minimaxir an hour ago | parent [-]

Huh, that is indeed up and to left of Opus.

walletdrainer 38 minutes ago | parent | prev [-]

5.1-codex supports that too, no? Pretty sure I’ve been using xhigh for at least a week now

causal 2 hours ago | parent | prev | next [-]

That ARC AGI score is a little suspicious. That's a really tough for AI benchmark. Curious if there were improvements to the test harness because that's a wild jump in general problem solving ability for an incremental update.

taurath 2 hours ago | parent [-]

I don’t think their words mean just about anything, only the behavior of the models.

Still waiting of Full Self Driving myself.

fuddle 33 minutes ago | parent | prev | next [-]

I don't think SWE Verified is an ideal benchmark, as the solutions are in the training dataset.

poormathskills 2 hours ago | parent | prev | next [-]

For a minor version update (5.1 -> 5.2) that's a way bigger improvement than I would have guessed.

beering 2 hours ago | parent [-]

Model capability improvements are very uneven. Changes between one model and the next tend to benefit certain areas substantially without moving the needle on others. You see this across all frontier labs’ model releases. Also the version numbering is BS (remember GPT-4.5 followed by GPT-4.1?).

catigula 2 hours ago | parent | prev | next [-]

Yes, but it's not good enough. They needed to surpass Opus 4.5.

mikairpods 2 hours ago | parent [-]

that is better...?

thinkingtoilet an hour ago | parent | prev [-]

Open AI has already been busted for getting benchmark information and training the models on that. At this point if you believe Sam Altman, I have a bridge to sell you.

sfmike 2 hours ago | parent | prev | next [-]

Everything is still based on 4 4o still right? is a new model training just too expensive? They can consult deepseek team maybe for cost constrained new models.

elgatolopez 2 hours ago | parent | next [-]

Where did you get that from? Cutoff date says august 2025. Looks like a newly pretrained model

FergusArgyll an hour ago | parent | next [-]

> This stands in sharp contrast to rivals: OpenAI’s leading researchers have not completed a successful full-scale pre-training run that was broadly deployed for a new frontier model since GPT-4o in May 2024, highlighting the significant technical hurdle that Google’s TPU fleet has managed to overcome.

- https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...

It's also plainly obvious from using it. The "Broadly deployed" qualifier is presumably referring to 4.5

SparkyMcUnicorn 2 hours ago | parent | prev [-]

If the pretraining rumors are true, they're probably using continued pretraining on the older weights. Right?

verdverm 2 hours ago | parent | prev | next [-]

Apparently they have not had a successful pre training run in 1.5 years

fouronnes3 2 hours ago | parent | next [-]

I want to read a short scify story set in 2150 about how, mysteriously, no one has been able to train a better LLM for 125 years. The binary weights are studied with unbelievably advanced quantum computers but no one can really train a new AI from scratch. This starts cults, wars and legends and ultimately (by the third book) leads to the main protagonist learning to code by hand, something that no human left alive still knows how to do. Could this be the secret to making a new AI from scratch, more than a century later?

WhyOhWhyQ 2 hours ago | parent | next [-]

There's a scifi short story about a janitor who knows how to do basic arithmetic and becomes the most important person in the world when some disaster happens. Of course after things get set up again due to his expertise, he becomes low status again.

ssl-3 an hour ago | parent | prev | next [-]

Sounds good.

Might sell better with the protagonist learning iron age leatherworking, with hides tanned from cows that were grown within earshot, as part of a process of finding the real root of the reason for why any of us ever came to be in the first place. This realization process culminates in the formation of a global, unified steampunk BDSM movement and a wealth of new diseases, and then: Zombies.

(That's the end. Zombies are always the end.)

verdverm 2 hours ago | parent | prev | next [-]

You can ask 2025 Ai to write such a book, it's happy to comply and may or may not actually write the book

https://www.pcgamer.com/software/ai/i-have-been-fooled-reddi...

georgefrowny an hour ago | parent | prev | next [-]

An software version of Asimov's Holmes-Ginsbook device? https://sfwritersworkshop.org/node/1232

I feel like there was a similar one about software, but it might have been mathematics (also Asimov: The Feeling of Power)

barrenko 2 hours ago | parent | prev | next [-]

Monsieur, if I may offer a vaaaguely similar story on how things may progress https://www.owlposting.com/p/a-body-most-amenable-to-experim...

armenarmen 2 hours ago | parent | prev [-]

I’d read it!

ijl 2 hours ago | parent | prev [-]

What kind of issues could prevent a company with such resources from that?

verdverm an hour ago | parent [-]

Drama if I had to pick the symptom most visible from the outside.

A lot of talent left OpenAI around that time, most notably in this regard would be Ilya in May '24. Remember that time Ilya and the board ousted Sam only to reverse it almost immediately?

https://arstechnica.com/information-technology/2024/05/chief...

Wowfunhappy 2 hours ago | parent | prev | next [-]

I thought whenever the knowledge cutoff increased that meant they’d trained a new model, I guess that’s completely wrong?

rockinghigh 36 minutes ago | parent | next [-]

They add new data to the existing base model via continuous pre-training. You save on pre-training, the next token prediction task, but still have to re-run mid and post training stages like context length extension, supervised fine tuning, reinforcement learning, safety alignment ...

brokencode 2 hours ago | parent | prev [-]

Typically I think, but you could pre-train your previous model on new data too.

I don’t think it’s publicly known for sure how different the models really are. You can improve a lot just by improving the post-training set.

catigula 2 hours ago | parent | prev [-]

The irony is that Deepseek is still running with a distilled 4o model.

blovescoffee an hour ago | parent [-]

Source?

sigmar 2 hours ago | parent | prev | next [-]

Are there any specifics about how this was trained? Especially when 5.1 is only a month old. I'm a little skeptical of benchmarks these days and wish they put this up on llmarena

edit: noticed 5.2 is ranked in the webdev arena (#2 tied with gemini-3.0-pro), but not yet in text arena (last update 22hrs ago)

emp17344 an hour ago | parent | next [-]

I’m extremely skeptical because of all those articles claiming OpenAI was freaking out about Gemini - now it turns out they just casually had a better model ready to go? I don’t buy it.

tempaccount420 an hour ago | parent [-]

They had to rush it out, I'm sure the internal safety folks are not happy about it.

kouteiheika 2 hours ago | parent | prev [-]

Unfortunately there are never any real specifics about how any of their models were trained. It's OpenAI we're talking about after all.

Tiberium 2 hours ago | parent | prev | next [-]

The only table where they showed comparisons against Opus 4.5 and Gemini 3:

https://x.com/OpenAI/status/1999182104362668275

https://i.imgur.com/e0iB8KC.png

varenc 2 hours ago | parent [-]

100% on the AIME (assuming its not in the training data) is pretty impressive. I got like 4/15 when I was in HS...

hellojimbo 38 minutes ago | parent [-]

The no tools part is impressive, with tools every model gets 100%

sundarurfriend 27 minutes ago | parent | prev | next [-]

> new context management using compaction.

Nice! This was one of the more "manual" LLM management things to remember to regularly do, if I wanted to avoid it losing important context over long conversations. If this works well, this would be a significant step up in usability for me.

speedgoose 34 minutes ago | parent | prev | next [-]

Trying it now in Vscode Insiders with Github Copilot (codex crashes with HTTP 400 server errors), and it eventually started using sed and grep in shells instead of using the better tools it has access to. I guess this is not an issue to perform well in benchmarks.

pixelmelt 6 minutes ago | parent [-]

to be fair I've seen the other sota models do this as well

SkyPuncher an hour ago | parent | prev | next [-]

Given the price increase and speculation that GPT 5 is a MoE model, I'm wondering if they're simply "turning up the good stuff" without making significant changes under the hood.

minimaxir an hour ago | parent | next [-]

I'm not sure why being a MoE model would allow OpenAI to "turn up the good stuff". You can't just increase the number of E without training it as such.

throwaway314155 an hour ago | parent | prev [-]

GPT 4o was an MoE model as well.

ComputerGuru 2 hours ago | parent | prev | next [-]

Wish they would include or leak more info about what this is, exactly. 5.1 was just released, yet they are claiming big improvements (on benchmarks, obviously). Did they purposely not release the best they had to keep some cards to play in case of Gemini 3 success or is this a tweak to use more time/tokens to get better output, or what?

zamadatix 2 hours ago | parent | prev | next [-]

https://openai.com/index/introducing-gpt-5-2/

ImprobableTruth 2 hours ago | parent | prev | next [-]

An almost 50% price increase. Benchmarks look nice, but 50% more nice...?

arnaudsm 2 hours ago | parent [-]

#1 models are usually priced at 2x more than the competition, and they often decrease the price right when they lose the crown.

dumbmrblah an hour ago | parent | prev | next [-]

Great! It'll be SOTA for a couple of weeks until the quality degrades due to throttling.

I'll stick with plug and play API instead.

mrandish 39 minutes ago | parent [-]

Due to the "Code Red" threat from Gemini 3, I suspect they'll hold off throttling for longer than usual (by incinerating even more investor capital than usual).

Jump in and soak up that extra-discounted compute while the getting is good, kids! Personally, I recently retired so I just occasionally mess around with LLMs for casual hobby projects, so I've only ever used the free tier of all the providers. Having lived through the dot com bubble, I regret not soaking up more of the free and heavily subsidized stuff back then. Trying not to miss out this time. All this compute available for free or below cost won't last too much longer...

zhyder an hour ago | parent | prev | next [-]

Big knowledge cutoff jump from Sep 2024 to Aug 2025. How'd they pull that off for a small point release, which presumably hasn't done a fresh pre-training over the web?

Did they figure out how to do more incremental knowledge updates somehow? If yes that'd be a huge change to these releases going forward. I'd appreciate the freshness that comes with that (without having to rely on web search as a RAG tool, which isn't as deeply intelligent, as is game-able by SEO).

With Gemini 3, my only disappointment was 0 change in knowledge cutoff relative to 2.5's (Jan 2025).

throwaway314155 an hour ago | parent [-]

> which presumably hasn't done a fresh pre-training over the web

What makes you think that?

> Did they figure out how to do more incremental knowledge updates somehow?

It's simple. You take the existing model and continue pretraining with newly collected data.

mattas 2 hours ago | parent | prev | next [-]

Are benchmarks the right way to measure LLMs? Not because benchmarks can be gamed, but because the most useful outputs of models aren't things that can be bucketed into "right" and "wrong." Tough problem!

Sir_Twist 2 hours ago | parent | next [-]

Not an expert in LLM benchmarks, but I generally I think of benchmarks as being good particularly for measuring usefulness for certain usecases. Even if measuring LLMs is not as straightforward as, say, read/write speeds when comparing different SSDs, if a certain model's responses are consistently measured as being higher quality / more useful, surely that means something, right?

olliepro 2 hours ago | parent | prev [-]

Do you have a better way to measure LLMs? Measurement implies quantitative evaluation... which is the same as benchmarks.

jstummbillig 16 minutes ago | parent | prev | next [-]

So, right off the bat: 5.2 code talk (through codex) feels really nice. The first coding attempt was a little meh compared to 5.1 codex max (reflecting what they wrote themselves), but simply planning / discussing things felt markedly better than anything I remember from any previous model, from any company.

I remain excited about new models. It's like finding my coworker be 10% smarter every other week.

jiggawatts 22 minutes ago | parent | prev | next [-]

Feels a bit rushed. They haven’t even updated their API playground yet, if I select 5.2-chat-latest, I get:

Unsupported parameter: 'top_p' is not supported with this model.

Also, without access to the Internet, it does not seem to know things up to August 2025. A simple test is to ask it about .NET 10 which was already in preview at that time and had lots of public content about its new features.

The model just guessed and waved its hand about, like a student that hadn’t read the assigned book.

gkbrk 2 hours ago | parent | prev | next [-]

Is this the "Garlic" model people have been hyping? Or are we not there yet?

0x457 an hour ago | parent [-]

Garlic will be released 2026Q1.

johnsutor 2 hours ago | parent | prev | next [-]

https://platform.openai.com/docs/models/gpt-5.2 More information on the price, context window, etc.

dandiep 2 hours ago | parent | prev | next [-]

Still no GPT 5.x fine tuning?

I emailed support a while back to see if there was an early access program (99.99% sure the answer is yes). This is when I discovered that their support is 100% done by AI and there is no way to escalate a case to a human.

jazzyjackson 2 hours ago | parent [-]

Containment breach is going to occur from a disgruntled customer convincing the customer service bot it needs to get a hold of a supervisor

dandiep an hour ago | parent [-]

I tried this!

Me: Is it possible get early access to the GPT 5 fine tuning? It is much better at non-English languages, which is the majority of what we do, so it would be greatly beneficial to us.

Bot: I'm an AI support agent. I don't have information about early access to GPT-5 fine-tuning or unannounced features. Currently, there is no public information or program for early access to GPT-5 fine-tuning. Please keep an eye on official OpenAI channels for updates. Let me know if you have any questions about current features or products!

Me: Can you ask someone from the team if I can be part of the early access? I know there is no public information, but there is an early access program and I would like to be part of it.

Bot: I'm an AI support agent and do not have the ability to forward requests or make inquiries about early access programs with the OpenAI team. If early access becomes available, it will be announced through official channels. Let me know if you have any other questions about OpenAI products or current fine-tuning options.

Me: How do I get in touch with the open ai team?

Bot: I'm an AI support agent and can only assist with OpenAI products or services through this support channel. There is no separate contact method for the OpenAI team beyond official support and public announcements. If you have questions about current products or features, I'm here to help

Should have tried a jailbreak maybe...

fulafel 2 hours ago | parent | prev | next [-]

So GDPval is OpenAI's own benchmark. PDF link: https://arxiv.org/pdf/2510.04374

andreygrehov 27 minutes ago | parent | prev | next [-]

Every new model is ‘state-of-the-art’. This term is getting annoying.

coolfox 2 hours ago | parent | prev | next [-]

the halving of error rates for image inputs is pretty awesome, this makes it far more practical for issues where it isn't easy to input all the needed context. when I get lazy I'll just shift+win+s the problem and ask one of the chatbots to solve it.

JanSt 2 hours ago | parent | prev | next [-]

The benchmarks are very impressive. Codex and Opus 4.5 are really good coders already and they keep getting better.

No wall yet and I think we might have crossed the threshold of models being as good or better than most engineers already.

GDPval will be an interesting benchmark and I'll happily use the new model to test spreadsheet (and other office work) capabilities. If they can going like this just a little bit further, much of the office workers will stop being useful.... I don't know yet how to feel about this.

Great for humanity probably but but for the individuals?

llmslave 2 hours ago | parent | next [-]

Yeah theres no wall on this. It will be able to mimic all of human behavior given proper data.

ionwake 2 hours ago | parent | prev | next [-]

it was only about 2-3 weeks when several HNers told me "nah you better re-check your code", when I explained I have over 2 decades xp of coding, yet have not manually edited code (in memory) for the last 6 or so months, whilst performing daily 12 hour daily vibe code seshes

ipsum2 an hour ago | parent [-]

It really depends on the complexity of code. I've found models (codex-5.1-max, opus 4.5) to be absolutely useless writing shaders or ML training code, but really good at basic web development.

sheeshe an hour ago | parent [-]

Which is no surprise as the data for web development stuff exists in large amounts on the web that the models feed off.

sheeshe an hour ago | parent | prev [-]

Ok so why isn’t there mass lay offs ensuing right now?

ghosty141 an hour ago | parent [-]

Because from my experience using codex in a decently complex c++ environment at work, it works REALLY well when it has things to copy. Refactorings, documentation, code review etc. all work great. But those things only help actual humans and they also take time. I estimate that in a good case I save ~50% of time, in a bad case it's negative and costs time.

But what I generally found, it's not that great at writing new code. Obviously an LLM can't think and you notice that quite quickly, it doesn't create abstractions, use abstractions or try to find general solution to problems.

People who get replaced by Codex are those who do repetitive tasks in a well understood field. For example, making basic websites, very simple crud applications etc..

I think it's also not layoffs but rather companies will hire less freelancers or people to manage small IT projects.

slackr 5 minutes ago | parent | prev | next [-]

“…where it outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations.”

What a sociopathic way to sell

Ninjinka 2 hours ago | parent | prev | next [-]

Man this was rushed, typo in the first section:

> Unlike the previous GPT-5.1 model, GPT-5.2 has new features for managing what the model "knows" and "remembers to improve accuracy.

petercooper an hour ago | parent [-]

Also, did they mention these features? I was looking out for it but got to the end and missed it.

(No, I just looked again and the new features listed are around verbosity, thinking level and the tool stuff rather than memory or knowledge.)

k2xl 2 hours ago | parent | prev | next [-]

The ARC AGI 2 bump to 52.9% is huge. Shockingly GPT 5.2 Pro does not add too much more (54.2%) for the increase cost.

ChrisArchitect an hour ago | parent | prev | next [-]

Discussion on blog post: https://openai.com/index/introducing-gpt-5-2/ (https://news.ycombinator.com/item?id=46234874)

HardCodedBias 2 hours ago | parent | prev | next [-]

Huge fan that Gemini-3 prompted OAI to ship this.

Competition works!

GDPval seems particularly strong.

I wonder why they held this back.

1) Maybe this is uneconomical ?

2) Did the safety somehow hold back the company ?

looking forward to the internet trying this and posting their results over the next week or two.

COMPETITION!

gigatexal 2 hours ago | parent | prev | next [-]

So how much better is it than opus or Gemini ?

iwontberude 15 minutes ago | parent | prev | next [-]

I have already cancelled. Claude is more than enough for me. I don’t see any point in splitting hairs. They are all going to keep lying more and more sneakily.

HackerThemAll an hour ago | parent | prev | next [-]

No, thank you, OpenAI and ChatGPT doesn't cut it for me.

wayeq 19 minutes ago | parent [-]

thanks for letting us know.

Xiol 2 hours ago | parent | prev | next [-]

Yawn.

system2 2 hours ago | parent | prev | next [-]

"Investors are putting pressure, change the version number now!!!"

exe34 2 hours ago | parent [-]

I'm quite sad about the S-curve hitting us hard in the transformers. For a short period, we had the excitement of "ooh if GPT-3.5 is so good, GPT-4 is going to be amazing! ooh GPT-4 has sparks of AGI!" But now we're back to version inflation for inconsequential gains.

verdverm 2 hours ago | parent | next [-]

2025 is the year most Big AI released their first real thinking models

Now we can create new samples and evals for more complex tasks to train up the next gen, more planning, decomp, context, agentic oriented

OpenAI has largely fumbled their early lead, exciting stuff is happening elsewhere

ToValueFunfetti 2 hours ago | parent | prev | next [-]

Take this all with a grain of salt as it's hearsay:

From what I understand, nobody has done any real scaling since the GPT-4 era. 4.5 was a bit larger than 4, but not as much as the orders of magnitude difference between 3 and 4, and 5 is smaller than 4.5. Google and Anthropic haven't gone substantially bigger than GPT-4 either. Improvements since 4 are almost entirely from reasoning and RL. In 2026 or 2027, we should see a model that uses the current datacenter buildout and actually scales up.

Leynos an hour ago | parent | next [-]

4.5 is widely believed to be an order of magnitude larger than GPT-4, as reflected in the API inference cost. The problem is the quantity of parameters you can fit in the memory of one GPU. Pretty much every large GPT model from 4 onwards has been mixture of experts, but for a 10 trillion parameter scale model, you'd be talking a lot of experts and a lot of inter-GPU communication.

With FP4 in the Blackwell GPUs, it should become much more practical to run a model of that size at the deployment roll-out of GPT-5.x. We're just going to have to wait for the GBx00 systems to be physically deployed at scale.

snovv_crash an hour ago | parent | prev [-]

Datacenter capacity is being snapped up for inference too though.

JanSt 2 hours ago | parent | prev | next [-]

I don't feel the S-curve at all yet. Still an exponential for me

gessha 2 hours ago | parent | prev [-]

Because it will take thousands of underpaid researchers random searching through solution space to get to the next improvement, not 2-3 companies pressed to monetize and enshittify their product before money runs out. That and winning more hardware lotteries.

orliesaurus 2 hours ago | parent | prev | next [-]

I told all my friends to upgrade or they're not my friends anymore /s

impulser_ an hour ago | parent | prev | next [-]

The thing about OpenAI is their models never fit anywhere for me. Yes they maybe smart or even the smartest models but they are alway so fucking slow. The ChatGPT web app is literally usable for me. I ask simple task and it does most extreme shit jsut to get an answer that the same as Claude or Gemini.

For example, I asked ChatGPT to take a chart and convert into a table. It went and cut up the image and zoomed in for literally 5 mins to get the a worst answer than Claude which did it in under a minute.

I see people talk about Codex like it better than Claude Code, and I go and try it and it takes a lifetime to do thing and it return maybe an on par result as Opus or Sonnet but it takes 5mins longer.

I just tried out this model and it the same exact thing. It just take ages for it to give you an answer.

I don't get how these models are useful in the real world.

What am I missing, is this just me?

I guess it truly an enterprise model.

wetoastfood 31 minutes ago | parent [-]

Are you using 5.1 Thinking? I tended to prefer Claude before this model.

I use models based on the task. They still seem specialized and better at specific tasks. If I have a question I tend to go to it. If I need code, I tend to go to Claude (Code).

I go to ChatGPT for questions I have because I value an accurate answer over a quick answer and, in my experience, it tends to give me more accurate answers because of its (over) willingness to go to the web for search results and question its instincts. Claude is much more likely to make an assumption and its search patterns aren't as thorough. The slow answers don't bother me because it's an expectation I have for how I use it and they've made that use case work really well with background processing and notifications.

egeres 2 hours ago | parent | prev [-]

It baffles me to see these last 2 announcements (GPT 5.1 as well) devoid of any metrics, benchmarks or quantitative analyses. Could it be because they are behind Google/Anthropic and they don't want to admit it?

(edit: I'm sorry I didn't read enough on the topic, my apologies)

zamadatix 2 hours ago | parent [-]

This isn't the announcement, it's the developer docs intro page to the model - https://openai.com/index/introducing-gpt-5-2/. Still doesn't answer cross-comparison, but at least has benchmark metrics they want to show off.