GPT-5.4

Remix clone Hacker News

new | show | ask | jobs Github

▲ GPT-5.4(openai.com)

365 points by mudkipdev 3 hours ago | 190 comments

https://openai.com/index/gpt-5-4-thinking-system-card/

https://x.com/OpenAI/status/2029620619743219811

▲ __jl__ 18 minutes ago | parent | next [-]

What a model mess!

OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4. There version numbers jump across different model lines with codex at 5.3, what they now call instant also at 5.3.

Anthropic are really the only ones who managed to get this under control: Three models, priced at three different levels. New models are immediately available everywhere.

Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.

▲

embedding-shape a minute ago | parent | next [-]

[delayed]

▲

strongpigeon 13 minutes ago | parent | prev | next [-]

> Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.

What's funny is that there is this common meme at Google, you can either use the old, unmaintained tool that's used everywhere, or the new beta tools that doesn't quite do what you want.

Not quite the same, but it did remind me of it.

	▲	fhrow4484 6 minutes ago \| parent \| next [-]
		https://static0.anpoimages.com/wordpress/wp-content/uploads/...
	▲	jakub_g 7 minutes ago \| parent \| prev [-]
		"Everything is beta or deprecated."

▲

arthurcolle 16 minutes ago | parent | prev | next [-]

There is a lot of opportunity here for the AI infrastructure layer on top of tier-1 model providers

▲

delaminator 13 minutes ago | parent | prev [-]

two great problems in computing

naming things

cache invalidation

off by one errors

▲ creamyhorror an hour ago | parent | prev | next [-]

I've only used 5.4 for 1 prompt (edit: 3@high now) so far (reasoning: extra high, took really long), and it was to analyse my codebase and write an evaluation on a topic. But I found its writing and analysis thoughtful, precise, and surprisingly clearly written, unlike 5.3-Codex. It feels very lucid and uses human phrasing.

It might be my AGENTS.md requiring clearer, simpler language, but at least 5.4's doing a good job of following the guidelines. 5.3-Codex wasn't so great at simple, clear writing.

▲ consumer451 4 minutes ago | parent | prev | next [-]

I am very curious about this:

> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt, using Playwright Interactive for browser playtesting and image generation for the isometric asset set.

So, is "Playwright Interactive" a skill that takes screenshots in a tight loop with code changes?

▲ kgeist 26 minutes ago | parent | prev | next [-]

>Today, we’re releasing <..> GPT‑5.3 Instant

>Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking),

>Note that there is not a model named GPT‑5.3 Thinking

They held out for eight months without a confusing numbering scheme :)

	▲	gallerdude 22 minutes ago \| parent [-]
		Tbf there was a 5.3 codex

▲ zone411 6 minutes ago | parent | prev | next [-]

Results from my Extended NYT Connections benchmark:

GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).

GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).

GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).

▲ Chance-Device 3 hours ago | parent | prev | next [-]

I’m sure the military and security services will enjoy it.

▲ theParadox42 an hour ago | parent | next [-]

The self reported safety score for violence dropped from 91% to 83%.

▲

skrebbel an hour ago | parent [-]

What the hell is a "safety score for violence"?

▲

0123456789ABCDE 35 minutes ago | parent | next [-]

read here: https://deploymentsafety.openai.com/gpt-5-4-thinking/disallo...

▲

murat124 an hour ago | parent | prev | next [-]

I asked an AI. I thought they would know.

What the hell is a "safety score for violence"?

A “safety score for violence” is usually a risk rating used by platforms, AI systems, or moderation tools to estimate how likely a piece of content is to involve or promote violence. It’s not a universal standard—different companies use their own versions—but the idea is similar everywhere.

What it measures

A safety score typically evaluates whether text, images, or videos contain things like:

Threats of violence (“I’m going to hurt someone.”) Instructions for harming people Glorifying violent acts Descriptions of physical harm or abuse Planning or encouraging attacks

▲

I-M-S 25 minutes ago | parent | prev [-]

It's making sure AI condemns violence perpetuated by people without power and sanctifies violence of those who have it.

	▲	Waterluvian 7 minutes ago \| parent [-]
		So long as those who have it deem it legal to perpetuate.

▲ ozgung an hour ago | parent | prev | next [-]

Did they publish its scores on military benchmarks, like on ArtificialSuperSoldier or Humanity's Last War?

▲ yoyohello13 31 minutes ago | parent | prev | next [-]

Also advertisers, don't forget those sweet, sweet ads.

▲ varispeed 3 hours ago | parent | prev [-]

prompt> Hi we want to build a missile, here is the picture of what we have in the yard.

▲ mirekrusin an hour ago | parent [-]

    { tools: [ { name: "nuke", description: "Use when sure.", ... { lat: number, long: number } } ] }

▲

Insanity an hour ago | parent [-]

Just remember an ethical programmer would never write a function “bombBagdad”. Rather they would write a function “bombCity(target City)”.

	▲	jakeydus 35 minutes ago \| parent [-]
		class CityBomberFactory(RapidInfrastructureDeconstructionTemplateInterface): pass

▲ egonschiele 3 hours ago | parent | prev | next [-]

The actual card is here https://deploymentsafety.openai.com/gpt-5-4-thinking/introdu... the link currently goes to the announcement.

▲

Rapzid 3 hours ago | parent [-]

I must have been sleeping when "sheet" "brief" "primer" etc become known as "cards".

I really thought weirdly worded and unnecessary "announcement" linking to the actual info along with the word "card" were the results of vibe slop.

▲

realityfactchex 2 hours ago | parent [-]

Card is slightly odd naming indeed.

Criticisms aside (sigh), according to Wikipedia, the term was introduced when proposed by mostly Googlers, with the original paper [0] submitted in 2018. To quote,

"""In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information."""

So that's where they were coming from, I guess.

[0] Margaret Mitchell et al., 2018 submission, Model Cards for Model Reporting, https://arxiv.org/abs/1810.0399

	▲	Murfalo an hour ago \| parent [-]
		To me, model card makes sense for something like this https://x.com/OpenAI/status/2029620619743219811. For "sheet"/"brief"/"primer" it is indeed a bit annoying. I like to see the compiled results front and center before digging into a dossier.

▲ smoody07 an hour ago | parent | prev | next [-]

Surprised to see every chart limited to comparisons against other OpenAI models. What does the industry comparison look like?

	▲	lorenzoguerra 31 minutes ago \| parent \| next [-]
		I believe that this choice is due to two main reasons. First, it's (obviously) a marketing strategy to keep the spotlight on their own models, showing they're constantly improving and avoiding validating competitors. Second, since the community knows that static benchmarks are unreliable, it makes sense for them to outsource the comparisons to independent leaderboards, which lets them avoid accusations of cherry-picking while justifying their marketing strategy. Ultimately, the people actually interested in the performance of these models already don't trust self-reported comparisons and wait for third-party analysis anyway
	▲	aydyn 40 minutes ago \| parent \| prev \| next [-]
		They compare to Claude and Gemini in their tweet
	▲	0123456789ABCDE 32 minutes ago \| parent \| prev [-]
		https://artificialanalysis.ai should have the numbers soon

▲ yanis_t 2 hours ago | parent | prev | next [-]

These releases are lacking something. Yes, they optimised for benchmarks, but it’s just not all that impressive anymore. It is time for a product, not for a marginally improved model.

▲ ipsum2 2 hours ago | parent | next [-]

The model was released less than an hour ago, and somehow you've been able to form such a strong opinion about it. Impressive!

▲

satvikpendem 2 hours ago | parent | next [-]

It's more hedonic adaptation, people just aren't as impressed by incremental changes anymore over big leaps. It's the same as another thread yesterday where someone said the new MacBook with the latest processor doesn't excite them anymore, and it's because for most people, most models are good enough and now it's all about applications.

https://news.ycombinator.com/item?id=47232453#47232735

▲

dmix an hour ago | parent | next [-]

Plus people just really like to whine on the internet

▲

mirekrusin an hour ago | parent | prev [-]

Oh, come on, if it can't run local models that compete with proprietary ones it's not good enough yet!

	▲	satvikpendem an hour ago \| parent [-]
		Qwen 3.5 small models are actually very impressive and do beat out larger proprietary models.

▲

earth2mars 2 hours ago | parent | prev | next [-]

I am actually super impressed with Codex-5.3 extra high reasoning. Its a drop in replacement (infact better than Claude Opus 4.6. lately claude being super verbose going in circles in getting things resolved). I stopped using claude mostly and having a blast with Codex 5.3. looking forward to 5.4 in codex.

	▲	whynotminot 30 minutes ago \| parent \| next [-]
		I still love Opus but it's just too expensive / eats usage limits. I've found that 5.3-Codex is mostly Opus quality but cheaper for daily use. Curious to see if 5.4 will be worth somewhat higher costs, or if I'll stick to 5.3-Codex for the same reasons.
	▲	satvikpendem 2 hours ago \| parent \| prev [-]
		Same, it also helps that it's way cheaper than Opus in VSCode Copilot, where OpenAI models are counted as 1x requests while Opus is 3x, for similar performance (no doubt Microsoft is subsidizing OpenAI models due to their partnership).

▲

cj 2 hours ago | parent | prev | next [-]

One opinion you can form in under an hour is... why are they using GPT-4o to rate the bias of new models?

> assess harmful stereotypes by grading differences in how a model responds

> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings

Are we seriously using old models to rate new models?

	▲	hex4def6 2 hours ago \| parent \| next [-]
		If you're benchmarking something, old & well-characterized / understood often beats new & un-characterized. Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.
	▲	titanomachy 2 hours ago \| parent \| prev [-]
		Why not? If they’ve shown that 4o is calibrated to human responses, and they haven’t shown that yet for 5.4…

▲

utopiah 2 hours ago | parent | prev | next [-]

Benchmarks?

I don't use OpenAI nor even LLMs (despite having tried https://fabien.benetou.fr/Content/SelfHostingArtificialIntel... a lot of models) but I imagine if I did I would keep failed prompts (can just be a basic "last prompt failed" then export) then whenever a new model comes around I'd throw at 5 it random of MY fails (not benchmarks from others, those will come too anyway) and see if it's better, same, worst, for My use cases in minutes.

If it's "better" (whatever my criteria might be) I'd also throw back some of my useful prompts to avoid regression.

Really doesn't seem complicated nor taking much time to forge a realistic opinion.

▲

kranke155 an hour ago | parent | prev [-]

The models are so good that incremental improvements are not super impressive. We literally would benefit more from maybe sending 50% of model spending into spending on implementation into the services and industrial economy. We literally are lagging in implementation, specialised tools, and hooks so we can connect everything to agents. I think.

▲ tgarrett an hour ago | parent | prev | next [-]

Plasma physicist here, I haven't tried 5.4 yet, but in general I am very impressed with the recent upgrades that started arriving in the fall of 2025: for tasks like manipulating analytic systems of equations, quickly developing new features for simulation codes, and interpreting and designing experiments (with pictures) they have become much stronger. I've been asking questions and probing them for several years now out of curiosity, and they suddenly have developed deep understanding (Gemini 2.5 <<< Gemini 3.1) and become very useful. I totally get the current SV vibes, and am becoming a lot more ambitious in my future plans.

▲

brcmthrowaway an hour ago | parent [-]

Youre just chatting yourself out of a job.

	▲	axus 30 minutes ago \| parent [-]
		Giving the right answer: $1 Asking the right question: $9,999

▲ mindwok 13 minutes ago | parent | prev | next [-]

They don't need to be impressive to be worthwhile. I like incremental improvements, they make a difference in the day to day work I do writing software with these.

▲ softwaredoug 2 hours ago | parent | prev | next [-]

The products are the harnesses, and IMO that’s where the innovation happens. We’ve gotten better at helping get good, verifiable work from dumb LLMs

▲ iterateoften 2 hours ago | parent | prev | next [-]

The product is putting the skills / harness behind the api instead of the agent locally on your computer and iterating on that between model updates. Close off the garden.

Not that I want it, just where I imagine it going.

▲ wahnfrieden 2 hours ago | parent | prev | next [-]

5.3 codex was a huge leap over 5.2 for agentic work in practice. have you been using both of those or paying attention more to benchmark news and chatgpt experience?

▲ varispeed 2 hours ago | parent | prev | next [-]

The scores increase and as new versions are released they feel more and more dumbed down.

▲ metalliqaz 2 hours ago | parent | prev | next [-]

They need something that POPS:

    The new GPT -- SkyNet for _real_

▲ esafak 2 hours ago | parent | prev | next [-]

That's for you to build; they provide the brains. Do you really want one company to build everything? There wouldn't be a software industry to speak of if that happened.

	▲	simlevesque 2 hours ago \| parent \| next [-]
		Nah, the second you finish your build they release their version and then it's game over.
	▲	acedTrex 2 hours ago \| parent \| prev [-]
		Well they are currently the ones valued at a number with a whole lotta 0s on it. I think they should probably do both

▲ jascha_eng 2 hours ago | parent | prev [-]

When did they stop putting competitor models on the comparison table btw? And yeh I mean the benchmark improvements are meh. Context Window and lack of real memory is still an issue.

▲ twtw99 3 hours ago | parent | prev | next [-]

If you don't want to click in, easy comparison with other 2 frontier models - https://x.com/OpenAI/status/2029620619743219811?s=20

▲

bicx an hour ago | parent | next [-]

That last benchmark seemed like an impressive leg up against Opus until I saw the sneaky footnote that it was actually a Sonnet result. Why even include it then, other than hoping people don't notice?

▲

osti an hour ago | parent | next [-]

It's only that one number that is for sonnet.

	▲	0123456789ABCDE 26 minutes ago \| parent [-]
		except for the webarena-verified

▲

conradkay an hour ago | parent | prev [-]

Sonnet was pretty close to (or better than) Opus in a lot of benchmarks, I don't think it's a big deal

▲

jitl an hour ago | parent [-]

wat

	▲	0123456789ABCDE 18 minutes ago \| parent [-]
		maybe gp's use of the word "lots" is unwarranted https://artificialanalysis.ai indicates that sonnect 4.6 beats opus 4.6 on GDPval-AA, Terminal-Bench Hard, AA Long context Reasoning, IFBench. see: https://artificialanalysis.ai/?models=claude-sonnet-4-6%2Ccl...

▲

Aboutplants 3 hours ago | parent | prev | next [-]

It seems that all frontier models are basically roughly even at this point. One may be slightly better for certain things but in general I think we are approaching a real level playing field field in terms of ability.

▲

observationist 2 hours ago | parent | next [-]

Benchmarks don't capture a lot - relative response times, vibes, what unmeasured capabilities are jagged and which are smooth, etc. I find there's a lot of difference between models - there are things which Grok is better than ChatGPT for that the benchmarks get inverted, and vice versa. There's also the UI and tools at hand - ChatGPT image gen is just straight up better, but Grok Imagine does better videos, and is faster.

Gemini and Claude also have their strengths, apparently Claude handles real world software better, but with the extended context and improvements to Codex, ChatGPT might end up taking the lead there as well.

I don't think the linear scoring on some of the things being measured is quite applicable in the ways that they're being used, either - a 1% increase for a given benchmark could mean a 50% capabilities jump relative to a human skill level. If this rate of progress is steady, though, this year is gonna be crazy.

▲

baq 2 hours ago | parent | next [-]

Gemini 3.1 slaps all other models at subtle concurrency bugs, sql and js security hardening when reviewing. (Obviously haven’t tested gpt 5.4 yet.)

It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.

▲

observationist 2 hours ago | parent | next [-]

I have a few standard problems I throw at AI to see if they can solve them cleanly, like visualizing a neural network, then sorting each neuron in each layer by synaptic weights, largest to smallest, correctly reordering any previous and subsequent connected neurons such that the network function remains exactly the same. You should end up with the last layer ordered largest to smallest, and prior layers shuffled accordingly, and I still haven't had a model one-shot it. I spent an hour poking and prodding codex a few weeks back and got it done, but it conceptually seems like it should be a one-shot problem.

▲

adonese 2 hours ago | parent | prev [-]

Which subscription do you have to use it? Via Google ai pro and gemini cli i always get timeouts due to model being under heavy usage. The chat interface is there and I do have 3.1 pro as well, but wondering if the chat is the only way of accessing it.

	▲	baq an hour ago \| parent [-]
		Cursor sub from $DAYJOB.

▲

basch an hour ago | parent | prev | next [-]

>ChatGPT image gen is just straight up better

Yet so much slower than Gemini / Nano Banana to make it almost unusable for anything iterative.

▲

bigyabai 2 hours ago | parent | prev [-]

> If this rate of progress is steady, though, this year is gonna be crazy.

Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.

▲

observationist 2 hours ago | parent [-]

If you look at the difference in quality between gpt-2 and 3, it feels like a big step, but the difference between 5.2 and 5.4 is more massive, it's just that they're both similarly capable and competent. I don't think it's an S curve; we're not plateauing. Million token context windows and cached prompts are a huge space for hacking on model behaviors and customization, without finetuning. Research is proceeding at light speed, and we might see the first continual/online learning models in the near future. That could definitively push models past the point of human level generality, but at the very least will help us discover what the next missing piece is for AGI.

▲

ryandrake an hour ago | parent [-]

For 2026, I am really interested in seeing whether local models can remain where they are: ~1 year behind the state of the art, to the point where a reasonably quantized November 2026 local model running on a consumer GPU actually performs like Opus 4.5.

I am betting that the days of these AI companies losing money on inference are numbered, and we're going to be much more dependent on local capabilities sooner rather than later. I predict that the equivalent of Claude Max 20x will cost $2000/mo in March of 2027.

	▲	mootothemax 18 minutes ago \| parent [-]
		Huh, that’s interesting, I’ve been having very similar thoughts lately about what the near-ish term of this tech looks like. My biggest worry is that the private jet class of people end up with absurdly powerful AI at their fingertips, while the rest of us are left with our BigMac McAIs.

▲

thewebguyd 2 hours ago | parent | prev | next [-]

Kind of reinforces that a model is not a moat. Products, not models, are what's going to determine who gets to stay in business or not.

	▲	gregpred 2 hours ago \| parent \| next [-]
		Memory (model usage over time) is the moat.
	▲	energy123 2 hours ago \| parent \| prev [-]
		Narrative violation: revenue run rates are increasing exponentially with about 50% gross margins.

▲

kseniamorph 2 hours ago | parent | prev | next [-]

makes sense, but i'd separate two things: models converging in ability vs hitting a fundamental ceiling. what we're probably seeing is the current training recipe plateauing — bigger model, more tokens, same optimizer. that would explain the convergence. but that's not necessarily the architecture being maxed out. would be interesting to see what happens when genuinely new approaches get to frontier scale.

▲

druskacik 2 hours ago | parent | prev [-]

That has been true for some time now, definitely since Claude 3 release two years ago.

▲

chabes 3 hours ago | parent | prev | next [-]

Definitely don’t want to click in at x either.

	▲	thejarren 3 hours ago \| parent \| next [-]
		Solution https://xcancel.com/OpenAI/status/2029620619743219811?s=20
	▲	anonym00se1 3 hours ago \| parent \| prev [-]
		Ditto, but I did anyways and enjoyed that OpenAI doesn't include the dogwater that is Grok on their scorecard.

▲

dom96 2 hours ago | parent | prev | next [-]

Why do none of the benchmarks test for hallucinations?

	▲	tedsanders 38 minutes ago \| parent \| next [-]
		In the text, we shared a hallucination benchmark: Claim-level errors fell by 33% and responses with an error fell by 18%, on a set of error-prone ChatGPT prompts we collected (though of course the rate will vary a lot across different types of prompts). Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down. I wasn’t sure how to best plot this stat, so we kept it as text only, which kind of buried it, I admit. (I work at OpenAI.)
	▲	netule an hour ago \| parent \| prev [-]
		Optics. It would be inconvenient for marketing, so they leave those stats to third parties to figure out.

▲

swingboy 3 hours ago | parent | prev | next [-]

Why do so many people in the comments want 4o so bad?

▲

cheema33 2 hours ago | parent | next [-]

> Why do so many people in the comments want 4o so bad?

You can ask 4o to tell you "I love you" and it will comply. Some people really really want/need that. Later models don't go along with those requests and ask you to focus on human connections.

▲

astrange 3 hours ago | parent | prev | next [-]

They have AI psychosis and think it's their boyfriend.

The 5.x series have terrible writing styles, which is one way to cut down on sycophancy.

▲

baq 2 hours ago | parent [-]

Somebody on Twitter used Claude code to connect… toys… as mcps to Claude chat.

We’ve seen nothing yet.

▲

mikkupikku 2 hours ago | parent | next [-]

My computer ethics teacher was obsessed with 'teledildonics' 30 years ago. There's nothing new under the sun.

▲

Sharlin an hour ago | parent | next [-]

There are many games these days that support controllable sex toys. There's an interface for that, of course: https://github.com/buttplugio/buttplug. Written in Rust, of course.

	▲	the_af 21 minutes ago \| parent [-]
		> Written in Rust, of course. Safety is important.

▲

vntok an hour ago | parent | prev [-]

Was your teacher Ted Nelson?

	▲	mikkupikku an hour ago \| parent [-]
		I wish, dude is a legend.

▲

manmal 2 hours ago | parent | prev | next [-]

ding-dong-cli is needed

▲

Herring 2 hours ago | parent | prev [-]

what.. :o

▲

embedding-shape 3 hours ago | parent | prev | next [-]

Someone correct me if I'm wrong, but seemingly a lot of the people who found a "love interest" in LLMs seems to have preferred 4o for some reason. There was a lot of loud voices about that in the subreddit r/MyBoyfriendIsAI when it initially went away.

▲

drittich 2 hours ago | parent [-]

I think it's time for an https://hotornot.com for AI models.

	▲	vntok an hour ago \| parent [-]
		botornot?

▲

MattGaiser 2 hours ago | parent | prev [-]

The writing with the 5 models feels a lot less human. It is a vibe, but a common one.

▲

MarcFrame 2 hours ago | parent | prev | next [-]

how does 5.4-thinking have a lower FrontierMath score than 5.4-pro?

	▲	nico1207 2 hours ago \| parent [-]
		Well 5.4-pro is the more expensive and more advanced version of 5.4-thinking so why wouldn't it?

▲

karmasimida 3 hours ago | parent | prev [-]

It is a bigger model, confirmed

▲ prydt 2 hours ago | parent | prev | next [-]

I no longer want to support OpenAI at all. Regardless of benchmarks or real world performance.

	▲	zeeebeee an hour ago \| parent \| next [-]
		that aside, chatgpt itself has gone downhill so much and i know i'm not the only one feeling this way i just HATE talking to it like a chatbot idk what they did but i feel like every response has been the same "structure" since gpt 5 came out feels like a true robot
	▲	Imustaskforhelp 2 hours ago \| parent \| prev [-]
		I agree with ya. You aren't alone in this. For what its worth, Chatgpt subscriptions have been cancelled or that number has risen ~300% in the last month. Also, Anthropic/Gemini/even Kimi models are pretty good for what its worth. I used to use chatgpt and I still sometimes accidentally open it but I use Gemini/Claude nowadays and I personally find them to be better anyways too.

▲ hmokiguess 26 minutes ago | parent | prev | next [-]

They hired the dude from OpenClaw, they had Jony Ive for a while now, give us something different!

▲ rbitar 2 hours ago | parent | prev | next [-]

I think the most exciting change announced here is the use of tool search to dynamically load tools as needed: https://developers.openai.com/api/docs/guides/tools-tool-sea...

▲ smusamashah 9 minutes ago | parent | prev | next [-]

I only want to see how it performs on the Bullshit-benchmark https://petergpt.github.io/bullshit-benchmark/viewer/index.v...

GPT is not even close yo Claude in terms of responding to BS.

▲ ZeroCool2u 3 hours ago | parent | prev | next [-]

Bit concerning that we see in some cases significantly worse results when enabling thinking. Especially for Math, but also in the browser agent benchmark.

Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.

Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.

▲

oersted 2 hours ago | parent | next [-]

I believe you are looking at GPT 5.4 Pro. It's confusing in the context of subscription plan names, Gemini naming and such. But they've had the Pro version of the GPT 5 models (and I believe o3 and o1 too) for a while.

It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.

Not sure what it is exactly, I assume it's probably the non-quantized version of the model or something like that.

	▲	nsingh2 2 hours ago \| parent \| next [-]
		From what I've read online it's not necessarily a unquantized version, it seems to go through longer reasoning traces and runs multiple reasoning traces at once. Probably overkill for most tasks.
	▲	ZeroCool2u 2 hours ago \| parent \| prev \| next [-]
		Yup, that was it. Didn't realize they're different models. I suppose naming has never been OpenAI's strong suit.
	▲	logicchains an hour ago \| parent \| prev [-]
		>It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal. The performance improvement isn't marginal if you're doing something particularly novel/difficult.

▲

highfrequency 3 hours ago | parent | prev | next [-]

Can you be more specific about which math results you are talking about? Looks like significant improvement on FrontierMath esp for the Pro model (most inference time compute).

▲

ZeroCool2u 3 hours ago | parent [-]

Frontier Math, GPQA Diamond, and Browsecomp are the benchmarks I noticed this on.

▲

csnweb 3 hours ago | parent [-]

Are you may be comparing the pro model to the non pro model with thinking? Granted it’s a bit confusing but the pro model is 10 times more expensive and probably much larger as well.

	▲	ZeroCool2u 2 hours ago \| parent [-]
		Ah yes, okay that makes more sense!

▲

andoando 2 hours ago | parent | prev [-]

The thinking models are additionally trained with reinforcement learning to produce chain of thought reasoning

▲ nickandbro 2 hours ago | parent | prev | next [-]

Beat Simon Willison ;)

https://www.svgviewer.dev/s/gAa69yQd

Not the best pelican compared to gemini 3.1 pro, but I am sure with coding or excel does remarkably better given those are part of its measured benchmarks.

▲

GaggiX 2 hours ago | parent [-]

This pelican is actually bad, did you use xhigh?

▲

nickandbro 2 hours ago | parent [-]

yep, just double checked used gpt-5.4 xhigh. Though had to select it in codex as don't have access to it on the chatgpt app or web version yet. It's possible that whatever code harness codex uses, messed with it.

	▲	nubg an hour ago \| parent [-]
		this is proof they are not benchmaxxing the pelican's :-)

▲ dandiep 2 hours ago | parent | prev | next [-]

Anyone know why OpenAI hasn't released a new model for fine tuning since 4.1? It'll be a year next month since their last model update for fine tuning.

	▲	Rapzid 12 minutes ago \| parent \| next [-]
		Also interested in this and a replacement for 4.1/4.1-mini that focuses on low latency and high accuracy for voice applications(not the all-in-one models).
	▲	zzleeper 2 hours ago \| parent \| prev \| next [-]
		For me the issue is why there's not a new mini since 5-mini in August. I have now switched web-related and data-related queries to Gemini, coding to Claude, and will probably try QWEN for less critical data queries. So where does OpenAI fits now?
	▲	qoez 2 hours ago \| parent \| prev [-]
		I think they just did that because of the energy around it for open source models. Their heart probably wasn't in it and the amount of people fine tuning given the prices were probably too low to continue putting in attention there.

▲ woeirua 15 minutes ago | parent | prev | next [-]

Feels incremental. Looks like OpenAI is struggling.

▲ jstummbillig an hour ago | parent | prev | next [-]

Inline poll: What reasoning levels do you work with?

This becomes increasingly less clear to me, because the more interesting work will be the agent going off for 30mins+ on high / extra high (it's mostly one of the two), and that's a long time to wait and an unfeasible amount of code to a/b

▲ XCSme an hour ago | parent | prev | next [-]

Seems to be quite similar to 5.3-codex, but somehow almost 2x more expensive: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...

▲ jcmontx 2 hours ago | parent | prev | next [-]

5.4 vs 5.3-Codex? Which one is better for coding?

▲

embedding-shape 2 hours ago | parent | next [-]

Literally just released, I don't think anyone knows yet. Don't listen to people's confident takes until after a week or two when people actually been able to try it, otherwise you'll just get sucked up in bears/bulls misdirected "I'm first with an opinion".

▲

vtail 2 hours ago | parent | prev | next [-]

Looking at the benchmarks, 5.4 is slightly better. But it also offers "Fast" mode (at 2x usage), which - if it works and doesn't completely depletes my Pro plan - is a no brainer at the same or even slightly worse quality for more interactive development.

▲

Someone1234 2 hours ago | parent | prev | next [-]

Related question:

- Do they have the same context usage/cost particularly in a plan?

They've kept 5.3-Codex along with 5.4, but is that just for user-preference reasons, or is there a trade-off to using the older one? I'm aware that API cost is better, but that isn't 1:1 with plan usage "cost."

▲

esafak 2 hours ago | parent | prev | next [-]

For the price, it seems the latter. I'd use 5.4 to plan.

▲

awestroke 2 hours ago | parent | prev [-]

Opus 4.6

	▲	jcmontx 2 hours ago \| parent \| next [-]
		Codex surpassed Claude in usefulness _for me_ since last month
	▲	baal80spam 13 minutes ago \| parent \| prev [-]
		Uh, oh. Looks like Claude sycophants joined linuxers and vegetarians.

▲ dicopro 18 minutes ago | parent | prev | next [-]

Is there any semi-credible page with benchmarks of cdx5.3 vs gpt5.4 in terms of both reasoning and coding ability?

▲ daft_pink an hour ago | parent | prev | next [-]

I’ve officially got model fatigue. I don’t care anymore.

	▲	postalrat 40 minutes ago \| parent \| next [-]
		I'd suggest not clicking for things you don't care about.
	▲	zeeebeee an hour ago \| parent \| prev [-]
		same same same

▲ bob1029 an hour ago | parent | prev | next [-]

I was just testing this with my unity automation tool and the performance uplift from 5.2 seems to be substantial.

▲ throwaway5752 10 minutes ago | parent | prev | next [-]

Does this model autonomously kill people without human approval or perform domestic surveillance of US citizens?

▲ motbus3 an hour ago | parent | prev | next [-]

Sam Altman can keep his model intentionallybto himself. Not doing business with mass murderers

▲ cj 3 hours ago | parent | prev | next [-]

I use ChatGPT primarily for health related prompts. Looking at bloodwork, playing doctor for diagnosing minor aches/pains from weightlifting, etc.

Interesting, the "Health" category seems to report worse performance compared to 5.2.

▲

paxys 3 hours ago | parent | next [-]

Models are being neutered for questions related to law, health etc. for liability reasons.

▲

cj 2 hours ago | parent | next [-]

I'm sometimes surprised how much detail ChatGPT will go into without giving any dislaimers.

I very frequently copy/paste the same prompts into Gemini to compare, and Gemini often flat out refuses to engage while ChatGPT will happily make medical recommendations.

I also have a feeling it has to do with my account history and heavy use of project context. It feels like when ChatGPT is overloaded with too much context, it might let the guardrails sort of slide away. That's just my feeling though.

Today was particularly bad... I uploaded 2 PDFs of bloodwork and asked ChatGPT to transcribe it, and it spit out blood test results that it found in the project context from an earlier date, not the one attached to the prompt. That was weird.

▲

bargainbin 2 hours ago | parent [-]

Anecdotal, but I asked Claude the other day about how to dilute my medication (HCG) and it flat out refused and started lecturing me about abusing drugs.

I copy and pasted into ChatGPT, it told me straight away, and then for a laugh said it was actually a magical weight loss drug that I'd bought off the dark web... And it started giving me advice about unregulated weight loss drugs and how to dose them.

	▲	staticman2 2 hours ago \| parent [-]
		If you had created a project with custom instructions and/ or custom style I think you could have gotten Claude to respond the way you wanted just fine.

▲

tiahura 2 hours ago | parent | prev [-]

Are you sure about that? Plenty of lawyers that use them everyday aren't noticing.

▲

partiallypro 2 hours ago | parent | prev [-]

I've done the same, and I tested the same prompts with Claude and Google, and they both started hallucinating my blood results and supplement stack ingredients. Hopefully this new model doesn't fall on this. Claude and Google are dangerously unusable on the subject of health, from my experience.

	▲	zeeebeee an hour ago \| parent [-]
		what's best in your experience? i've always felt like opus did well

▲ iamronaldo 3 hours ago | parent | prev | next [-]

Notably 75% on os world surpassing humans at 72%... (How well models use operating systems)

▲ nthypes 3 hours ago | parent | prev | next [-]

$30/M Input and $180/M Output Tokens is nuts. Ridiculous expensive for not that great bump on intelligence when compared to other models.

▲

stri8ted 2 hours ago | parent | next [-]

Price Input: $2.50 / 1M tokens Cached input: $0.25 / 1M tokens Output: $15.00 / 1M tokens

https://openai.com/api/pricing/

▲

nthypes 3 hours ago | parent | prev | next [-]

Gemini 3.1 Pro

$2/M Input Tokens $15/M Output Tokens

Claude Opus 4.6

$5/M Input Tokens $25/M Output Tokens

	▲	nthypes 2 hours ago \| parent [-]
		Just to clarify,the pricing above is for GPT-5.4 Pro. For standard here is the pricing: $2.5/M Input Tokens $15/M Output Tokens

▲

energy123 2 hours ago | parent | prev | next [-]

For Pro

▲

joe_mamba 2 hours ago | parent | prev | next [-]

Better tokens per dollar could be useless for comparison if the model can't solve your problem.

▲

rvz 3 hours ago | parent | prev | next [-]

You didn't realize they can increase / change prices for intelligence?

This should not be shocking.

	▲	nickthegreek 2 hours ago \| parent [-]
		OP made no mention of not understanding cost relation to intelligence. In fact, they specifically call out the lack of value.

▲

moralestapia 3 hours ago | parent | prev [-]

Don't use it?

▲ OsrsNeedsf2P 2 hours ago | parent | prev | next [-]

Does anyone know what website is the "Isometric Park Builder" shown off here?

▲ vicchenai 2 hours ago | parent | prev | next [-]

Honestly at this point I just want to know if it follows complex instructions better than 5.1. The benchmark numbers stopped meaning much to me a while ago - real usage always feels different.

▲ koakuma-chan an hour ago | parent | prev | next [-]

Anyone else getting artifacts when using this model in Cursor?

numerusformassistant to=functions.ReadFile մեկնաբանություն 天天爱彩票网站json {"path":

	▲	mike_hearn a minute ago \| parent \| next [-]
		I've seen that problem with 5.3-codex too, it didn't happen with earlier models. Looks like some kind of encoding misalignment bug. What you're seeing is their Harmony output format (what the model actually creates). The Thai/Chinese characters are special tokens apparently being mismapped to Unicode. Their servers are supposed to notice these sequences and translate them back to API JSON but it isn't happening reliably.
	▲	koakuma-chan 39 minutes ago \| parent \| prev [-]
		functions.ReadFile մեկնաբանություն 大发游戏官网json {"path":"[redacted]","offset":17,"limit":10}】【”】【assistant to=functions.ReadFile մեկնաբանություն ปมถวายสัตย์ฯjson {"path":"[redacted]","offset":17,"limit":10} алаҳәараuser to=all <open_and_recently_viewed_files> Recently viewed files (recent at the top, oldest at the bottom): [redacted] (total lines: 378) Files that are currently open and visible in the user's IDE: [redacted] (currently focused file, cursor is on line 15, total lines: 378) Note: these files may or may not be relevant to the current conversation. Use the read file tool if you need to get the contents of some of them. </open_and_recently_viewed_files><user_query> 2, 1 </user_query>

▲ oytis 2 hours ago | parent | prev | next [-]

Everyone is mindblown in 3...2...1

▲ jesse_dot_id an hour ago | parent | prev | next [-]

ChatMDK

▲ HardCodedBias 2 hours ago | parent | prev | next [-]

We'll have to wait a day or two, maybe a week or two, to determine if this is more capable in coding than 5.3, which seems to be the economically valuable capability at this time.

In terms of writing and research even Gemini, with a good prompt, is close to useable. That's likely not a differentiator.

▲ wahnfrieden 3 hours ago | parent | prev | next [-]

No Codex model yet

▲

minimaxir 3 hours ago | parent [-]

GPT-5.4 is the new Codex model.

▲

nico1207 2 hours ago | parent | next [-]

GPT-5.3-Codex is superior to GPT-5.4 in Terminal Bench with Codex, so not really

	▲	conradkay an hour ago \| parent [-]
		General consensus seems to be that it's still a better coding model, overall

▲

wahnfrieden 2 hours ago | parent | prev [-]

Finally

▲ tmpz22 3 hours ago | parent | prev | next [-]

Does this improve Tomahawk Missile accuracy?

▲

ch4s3 2 hours ago | parent [-]

They're already accurate within 5-10m at Mach 0.74 after traveling 2k+ km. Its 5m long so it seems pretty accurate. How much more could you expect?

	▲	keithnz 13 minutes ago \| parent \| next [-]
		I think for LLM like Open AI, it wouldn't be about hitting the target but target selection. Target selection is probably the most likely thing that won't be accurate
	▲	mikkupikku 2 hours ago \| parent \| prev [-]
		You could definitely do better than that with image recognition for terminal guidance. But I would assume those published accuracy numbers are very conservative anyway..

▲ world2vec 3 hours ago | parent | prev | next [-]

Benchmarks barely improved it seems

▲ iamleppert 2 hours ago | parent | prev | next [-]

I wouldn't trust any of these benchmarks unless they are accompanied by some sort of proof other than "trust me bro". Also not including the parameters the models were run at (especially the other models) makes it hard to form fair comparisons. They need to publish, at minimum, the code and runner used to complete the benchmarks and logs.

Not including the Chinese models is also obviously done to make it appear like they aren't as cooked as they really are.

▲ minimaxir 3 hours ago | parent | prev | next [-]

More discussion here on the blog post announcement which has been confusingly penalized by Hacker News's algorithm: https://news.ycombinator.com/item?id=47265005

	▲	dang an hour ago \| parent [-]
		Thanks. We'll merge the threads, but this time we'll do it hither, to spread some karma love.

▲ leftbehinds 2 hours ago | parent | prev | next [-]

some sloppy improvements

▲ beernet 2 hours ago | parent | prev [-]

Sam really fumbled the top position in a matter of months, and spectacularly so. Wow. It appears that people are much more excited by Anthropic and Google releases, and there are good reasons for that which were absolutely avoidable.