Remix.run Logo
_fat_santa 3 days ago

So at this point OpenAI has 6 reasoning models, 4 flagship chat models, and 7 cost optimized models. So that's 17 models in total and that's not even counting their older models and more specialized ones. Compare this with Anthropic that has 7 models in total and 2 main ones that they promote.

This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones. In fact based on some of the other comments here it sounds like these are just updates to their existing model, but they release them as new models to create more media buzz.

shmatt 3 days ago | parent | next [-]

Im old enough to remember the mystery and hype before o*/o1/strawberry that was supposed to be essentially AGI. We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet

Now we're up to o4, AGI is still not even in near site (depending on your definition, I know). And OpenAI is up to about 5000 employees. I'd think even before AGI a new model would be able to cover for at least 4500 of those employees being fired, is that not the case?

pants2 3 days ago | parent | next [-]

Remember that Docusign has 7,000 employees. I think OpenAI is pretty lean for what they're accomplishing.

steamrolled 3 days ago | parent | next [-]

I don't think these comparisons are useful. Every time you look at companies like LinkedIn or Docusign, yeah - they have a lot of staff, but a significant proportion of this are functions like sales, customer support, and regulatory compliance across a bazillion different markets; along with all the internal tooling and processes you need to support that.

OpenAI is at a much earlier stage in their adventures and probably doesn't have that much baggage. Given their age and revenue streams, their headcount is quite substantial.

shmatt 3 days ago | parent | prev | next [-]

If we're making comparisons, its more like someone selling a $10,000 course on how to be a millionaire

Not directly from OpenAI - but people in the industry is advertising how these advanced models can replace employees, yet they keep on going on hiring tears (including OpenAI). Lets see the first company to stand behind their models, and replace 50% of their existing headcount with agents. That to me would be a sign these things are going to replace peoples jobs. Until I see that, if OpenAI can't figure out how to replace humans with models, then no one will

I mean could you imagine if todays announcement was - the chatgpt.com webdev team has been laid off, and all new features and fixes will be complete by Codex CLI + o4-mini. That means they believe in the product theyre advertising. Until they do something like that, theyll keep on trusting those human engineers and try selling other people on the dream

mrandish 3 days ago | parent | next [-]

I'm also a skeptic on AI replacing many human jobs anytime soon. It's mostly going to assist, accelerate or amplify humans in completing work better or faster. That's the typical historical technology cycle where better tech makes work more efficient. Eventually that does allow the same work to be done with less people, like a better IP telephony system enabling a 90 person call center to handle the same call volume that previously required 100 people. But designing, manufacturing, selling, installing and supporting the new IP phone system also creates at least 10 new jobs.

So far the only significant human replacement I'm seeing AI enable is in low-end, entry level work. For example, fulfilling "gig work" for Fiverr like spending an hour or two whipping up a relatively low-quality graphic logo or other basic design work for $20. This is largely done at home by entry-level graphic design students in second-world locales like the Philippines or rural India. A good graphical AI can (and is) taking some of this work from the humans doing it. Although it's not even a big impact yet, primarily because for non-technical customers, the Fiverr workflow can still be easier or more comfortable than figuring out which AI tool to use and how to get what they really want from it.

The point is that this Fiverr piece-meal gig work is the lowest paying, least desirable work in graphic design. No one doing it wants to still be doing it a year or two from now. It's the Mcdonald's counter of their industry. They all aspire to higher skill, higher paying design jobs. They're only doing Fiverr gig work because they don't yet have a degree, enough resume credits or decent portfolio examples. Much like steam-powered bulldozers and pile drivers displaced pick axe swinging humans digging railroad tunnels in the 1800s, the new technology is displacing some of the least-desirable, lowest-paying jobs first. I don't yet see any clear reason this well-established 200+ year trend will be fundamentally different this time. And history is littered with those who predicted "but this time it'll be different."

I've read the scenarios which predict that AI will eventually be able to fundamentally and repeatedly self-improve autonomously, at scale and without limit. I do think AI will continue to improve but, like many others, I find the "self-improve" step to be a huge and unevidenced leap of faith. So, I don't think it's likely, for reasons I won't enumerate here because domain experts far smarter than I am have already written extensively about them.

esafak 3 days ago | parent | prev [-]

Not really. It could also mean their company's effective headcount is much greater than its nominal one.

scarface_74 3 days ago | parent | prev | next [-]

Yes and Amazon has 1.52 million employees. How many developers could they possibly need?

Or maybe it’s just nonsensical to compare the number of employees across companies - especially when they don’t do nearly the same thing.

On a related note, wait until you find out how many more employees that Apple has than Google since Apple has hundreds of retail employees.

jsnell 3 days ago | parent | next [-]

Apple has fewer employees than Google (164k < 183k).

solardev 3 days ago | parent [-]

Siri must be really good.

kupopuffs 3 days ago | parent | prev [-]

what kind of employees does Docusign employ? surely Digital Documents dont require physical onsite distribution centers and labor

scarface_74 3 days ago | parent [-]

Just look at their careers page

kupopuffs 2 days ago | parent [-]

Its a lot of sales/account managers. and some engineers

wow the sales go hard in this product

throwanem 3 days ago | parent | prev [-]

[flagged]

andrewinardeer 3 days ago | parent | next [-]

The US is not a signatory to the International Criminal Court so you won't see Musk on trial there.

throwanem 3 days ago | parent [-]

I hope I don't have to link this adjacent reply of mine too many more times: https://news.ycombinator.com/item?id=43709056 Specifically "The venue is a matter of convenience, nothing more," and if you prefer another, that would work about as well. Perhaps Merano; I hear it's a lovely little town.

ein0p 3 days ago | parent | prev [-]

The closest Elon ever came to anything Hague-worthy is allowing Starlink to be used in Ukrainian attacks on Russian civilian infrastructure. I don't think the Hague would be interested in anything like that. And if his life is worthless, then what would you say about your own? Nonetheless, I commend you on your complete lack of hinges. /s

throwanem 3 days ago | parent [-]

Oh, I'm thinking more in the sense of the special one-off kinds of trials, the sort Gustave Gilbert so ably observed. The venue is a matter of convenience, nothing more. To the rest I would say the worth of my life is no more mine to judge than anyone else is competent to do the same for themselves, or indeed other than foolish to pursue the attempt.

fsndz 3 days ago | parent | prev | next [-]

True.

Deep learning models will continue to improve as we feed them more data and use more compute, but they will still fail at even very simple tasks as long as the input data are outside their training distribution. The numerous examples of ChatGPT (even the latest, most powerful versions) failing at basic questions or tasks illustrate this well. Learning from data is not enough; there is a need for the kind of system-two thinking we humans develop as we grow. It is difficult to see how deep learning and backpropagation alone will help us model that. https://medium.com/thoughts-on-machine-learning/why-sam-altm...

stavros 3 days ago | parent | prev | next [-]

> Im old enough to remember the mystery and hype before o*/o1/strawberry

So at least two years old?

throwanem 3 days ago | parent | next [-]

Honestly, sometimes I wonder if most people these days kinda aren't at least that age, you know? Or less inhibited about acting it than I believe I recall people being last decade. Even compared to just a few years back, people seem more often to struggle to carry a thought, and resort much more quickly to emotional belligerence.

Oh, not that I haven't been as knocked about in the interim, of course. I'm not really claiming I'm better, and these are frightening times; I hope I'm neither projecting nor judging too harshly. But even trying to discount for the possibility, there still seems something new left to explain.

Hackbraten 3 days ago | parent [-]

> Even compared to just a few years back, people seem more often to struggle to carry a thought, and resort much more quickly to emotional belligerence.

We're living in extremely uncertain times, with multiple global crises taking place at the same time, each of which could develop into a turning point for humankind.

At the same time, predatory algorithms do whatever it takes to make people addicted to media, while mental health care remains inaccessible for many.

I feel like throwing a tantrum almost every single day.

throwanem 3 days ago | parent [-]

I feel perhaps I've been unkind to many people in my thoughts, but I'm conflicted. I don't understand myself to be particularly fearless, but what times call more for courage than times like these? How do people afraid even to try to practice courage expect to find it, when there isn't time for practice any more?

Hackbraten 3 days ago | parent [-]

You have only so many spoons available per crisis. Even picking your battle can become a problem.

I've been out in the streets, protesting and raising awareness of climate change. I no longer do. It's a pointless waste of time. Today, the climate change deniers are in charge.

throwanem 2 days ago | parent [-]

I don't assume I'm going to be given the luxury of picking my battles, and - though I've been aware of "spoon theory" since I watched it getting invented at Shakesville back in the day - I've never held to it all that strongly, even as I acknowledge I've also never been quite the same since a nasty bout of wild-type covid in early 2020. Now as before, I do what needs doing as best I can, then count the cost. Some day that will surely prove too high, and my forward planning efforts will be put to the test. Till then I'm happy not to borrow trouble.

I've lived in this neighborhood a long time, and there are a couple of old folks' homes a block or so from here. Both have excellent views, on one frontage each, of an extremely historic cemetery, which I have always found a wonderfully piquant example of my adopted hometown's occasionally wire-brush sense of humor. But I bring it up to mention that the old folks don't seem to have much concern for spoons other than to eat with, and they are protesting the present situation regularly and at considerable volume, and every time I pass about my errands I make a point of raising a fist and hollering "hell yeah!" just like most of the people who drive past honk in support.

Will you tell them it's pointless?

bananaflag 3 days ago | parent | prev [-]

I think people expected reasoning to be more than just trained chain of thought (which was known already at the time). On the other hand, it is impressive that CoT can achieve so much.

irthomasthomas 3 days ago | parent | prev | next [-]

Yeah, I don't know exactly what at an AGI model will look like, but I think it would have more than 200k context window.

doug_durham 3 days ago | parent | next [-]

Do you have a 200k context window? I don't. Most humans can only keep 6 or 7 things in short term memory. Beyond those 6 or 7 you are pulling data from your latent space, or replacing of the short term slots with new content.

sixQuarks 3 days ago | parent [-]

But context windows for LLMs include all the “long term memory” things you’re excluding from humans

kadushka 3 days ago | parent [-]

Long term memory in an LLM is its weights.

echoangle 3 days ago | parent [-]

Not really, because humans can form long term memories from conversations, but LLM users aren’t finetuning models after every chat so the model remembers.

esafak 3 days ago | parent | next [-]

He's right, but most people don't have the resources, nor indeed the weights themselves, to keep training the models. But the weights are very much long term memory.

kadushka 3 days ago | parent | prev [-]

users aren’t finetuning models after every chat

Users can do that if they want, but it’s more effective and more efficient to do that after every billion chats, and I’m sure OpenAI does it.

echoangle 3 days ago | parent [-]

If you want the entire model to remember everything it talked about with every user, sure. But ideally, I would want the model to remember what I told it a few million tokens ago, but not what you told it (because to me, the model should look like my private copy that only talks to me).

kadushka 2 days ago | parent [-]

ideally, I would want the model to remember what I told it a few million tokens ago

Yes, you can keep finetuning your model on every chat you have with it. You can definitely make it remember everything you have ever said. LLMs are excellent at remembering their training data.

boznz 3 days ago | parent | prev | next [-]

I'm not quite AGI, but I work quite adequately with a much, much smaller memory. Maybe AGI just needs to know how to use other computers and work with storage a bit better.

kurthr 3 days ago | parent | prev [-]

I'd think it would be able to at least suggest which model to use rather than just having 6 for you to choose from.

chrsw 3 days ago | parent | prev | next [-]

I’m not an AI researcher but I’m not convinced these contemporary artificial neural networks will get us to AGI, even assuming an acceleration to current scaling pace. Maybe my definition of AGI is off but I’m thinking what that means is a machine that can think, learn and behave in the world in ways very close to human. I think we need a fundamentally different paradigm for that. Not something that is just trained and deployed like current models, but something that is constantly observing, constantly learning and constantly interacting with the real world like we do. AHI, not AGI. True AGI may not exist because there are always compromises of some kind.

But, we don’t need AGI/AHI to transform large parts of our civilization. And I’m not seeing this happen either.

chpatrick 3 days ago | parent [-]

I feel like every time AI gets better we shift the goalposts of AGI to something else.

chrsw 3 days ago | parent [-]

I don't think we shift the goalposts for AGI. I'm not getting the sense that people are redefining what AGI is when a new model is released. I'm getting the sense that some people are thinking like me when a new model is released: we got a better transformer, and a more useful model trained on more or better data, but we didn't get closer to AGI. And people are saying this not because they've pushed out what AGI really means, they're saying this because the models still have the same basic use cases, the same flaws and the same limitations. They're just better at what they already do. Also, the better these models get at what they already do, the more starkly they contrast with human capabilities, for better or worse.

MoonGhost 3 days ago | parent | prev | next [-]

> Now we're up to o4, AGI is still not even in near site (depending on your definition, I know)

It's not only definition. Some googler was sure their model was conscious.

actsasbuffoon 3 days ago | parent | prev | next [-]

Meanwhile even the highest ranked models can’t do simple logic tasks. GothamChess on YouTube did some tests where he played against a bunch of the best models and every single one of them failed spectacularly.

They’d happily lose a queen to take a pawn. They failed to understand how pieces are even allowed to move, hallucinated the existence of new pieces, repeatedly declared checkmate when it wasn’t, etc.

I tried it last night with Gemini 2.5 Pro and it made it 6 turns before it started making illegal moves, and 8 turns before it got so confused about the state of the board before it refused to play with me any longer.

I was in the chess club in 3rd grade. One of the top ranked LLMs in the world is vastly dumber than I was in 3rd grade. But we’re going to pour hundreds of billions into this in the hope that it can end my career? Good luck with that, guys.

schindlabua 3 days ago | parent | next [-]

Chess is not exactly a simple logic task. It requires you to keep track of 32 things in a 2d space.

I remember being extremely surprised when I could ask GPT3 to rotate a 3d model of a car in it's head and ask it about what I would see when sitting inside, or which doors would refuse to open because they're in contact with the ground.

It really depends on how much you want to shift the goalposts on what constitutes "simple".

yusina 3 days ago | parent [-]

> Chess is not exactly a simple logic task.

Compare to what a software engineer is able to do, it is very much a simple logic task. Or the average person having a non-trivial job. Or a beehive organizing its existence, from its amino acids up to hive organization. All those things are magnitudes harder than chess.

> I remember being extremely surprised when I could ask GPT3 to rotate a 3d model of a car in it's head and ask it about what I would see when sitting inside, or which doors would refuse to open because they're in contact with the ground.

It's not reasoning its way there. Somebody asked something similar some time in the corpus and that corpus also contained the answers. That's why it can answer. After a quite small number of moves, the chess board it unique and you can't fake it. You need to think ahead. A task which computers are traditionally very good at. Even trained chess players are. That LLMs are not goes to show that they are very far from AGI.

JFingleton 3 days ago | parent | prev | next [-]

I'm not sure why people are expecting a language model to be great at chess. Remember they are trained on text, which is not the best medium for representing things like a chess board. They are also "general models", with limited training on pretty much everything apart from human language.

An Alpha Star type model would wipe the floor at chess.

actsasbuffoon 3 days ago | parent | next [-]

This misses the point. LLMs will do things like move a knight by a single square as if it were a pawn. Chess is an extremely well understood game, and the rules about how things move is almost certainly well-represented in the training data.

These models cannot even make legal chess moves. That’s incredibly basic logic, and it shows how LLMs are still completely incapable of reasoning or understanding. Many kinds of task are never going to be possible for LLMs unless that changes. Programming is one of those tasks.

og_kalu 3 days ago | parent | next [-]

>These models cannot even make legal chess moves. That’s incredibly basic logic, and it shows how LLMs are still completely incapable of reasoning or understanding.

Yeah they can. There's a link I shared to prove it which you've conveniently ignored.

LLMs learn by predicting, failing and getting a little better, rinse and repeat. Pre-training is not like reading a book. LLMs trained on chess games play chess just fine. They don't make the silly mistakes you're talking about and they very rarely make illegal moves.

There's gpt-3.5-turbo-instruct which i already shared and plays at around 1800 ELO. Then there's this grandmaster level chess transformer - https://arxiv.org/abs/2402.04494. They're also a couple of models that were trained in the Eleuther AI discord that reached about 1100-1300 Elo.

I don't know what the peak of LLM Chess playing looks like but this is clearly less of a 'LLMs can't do this' problem and more 'Open AI/Anthropic/Google etc don't care if their models can play Chess or not' problem.

So are they capable of reasoning now or would you like to shift the posts ?

int_19h 3 days ago | parent [-]

I think the point here is that if you have to pretrain it for every specific task, it's not artificial general intelligence, by definition.

og_kalu 3 days ago | parent [-]

There isn't any general intelligence that isn't receiving pre-traning. People spend 14 to 18+ years in school to have any sort of career.

You don't have to pretrain it for every little thing but it should come as no surprise that a complex non-trivial game would require it.

Even if you explained all the rules of chess clearly to someone brand new to it, it will be a while and lots of practice before they internalize it.

And like I said, LLM pre-training is less like a machine reading text and more like Evolution. If you gave a corpus of chess rules, you're only training a model that knows how to converse about chess rules.

Do humans require less 'pre-training' ? Sure, but then again, that's on the back of millions of years of evolution. Modern NNs initialize random weights and have relatively very little inductive bias.

sceptic123 2 days ago | parent [-]

People are focussing on chess, which is complicated, but LLM fail at even simple games like tic-tac-toe where you'd think, if it was capable of "reasoning" it would be able to understand where it went wrong. That doesn't seem to be the case.

What it can do is write and execute code to generate the correct output, but isn't that cheating?

int_19h 2 days ago | parent [-]

Which SOTA LLM fails at tic-tac-toe?

simonw 3 days ago | parent | prev [-]

Saying programming is a task that is "never going to be possible" for an LLM is a big claim, given how many people have derived huge value from having LLMs write code for them over the past two years.

(Unless you're arguing against the idea that LLMs are making programmers obsolete, in which case I fully agree with you.)

sceptic123 2 days ago | parent [-]

I think "useful as an assistant for coding" and "being able to program" are two different things.

When I was trying to understand what is happening with hallucination GPT gave me this: > It's called hallucinating when LLMs get things wrong because the model generates content that sounds plausible but is factually incorrect or made-up—similar to how a person might "see" or "experience" things that aren't real during a hallucination.

From that we can see that they fundamentally don't know what is correct. While they can get better at predicting correct answers, no-one has explained how they are expected to cross the boundary from "sounding plausible" to "knowing they are factually correct". All the attempts so far seem to be about reducing the likelihood of hallucination, not fixing the problem that they fundamentally don't understand what they are saying.

Until/unless they are able to understand the output enough to verify the truth then there's a knowledge gap that seems dangerous given how much code we are allowing "AI" to write.

simonw 2 days ago | parent [-]

Code is one of the few applications of LLMs where they DO have a mechanism for verifying if what they produced is correct: they can write code, run that code, look at the output and iterate in a loop until it does what it's supposed to do.

wijwp 3 days ago | parent | prev [-]

> I'm not sure why people are expecting a language model to be great at chess.

Because the conversation is about AGI, and how far away we are from AGI.

megablast 3 days ago | parent [-]

Does AGI mean good at chess?

What if it is a dumb AGI?

code_biologist 2 days ago | parent | prev | next [-]

Claude can't beat Pokemon Red. Not even close yet: https://arstechnica.com/ai/2025/03/why-anthropics-claude-sti...

og_kalu 3 days ago | parent | prev [-]

LLMs can play chess fine.

The best model you can play with is decent for a human - https://github.com/adamkarvonen/chess_gpt_eval

SOTA models can't play it because these companies don't really care about it.

LinuxAmbulance 3 days ago | parent | prev [-]

> We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet

I wonder if any of the people that quit regret doing so.

Seems a lot like Chicken Little behavior - "Oh no, the sky is falling!"

How anyone with technical acumen thinks current AI models are conscious, let alone capable of writing new features and expanding their abilities is beyond me. Might as well be afraid of calculators revolting and taking over the world.

leesec 3 days ago | parent | prev | next [-]

"haven't actually done much" being popularizing the chat llm and absolutely dwarfing the competition in paid usage

caconym_ 3 days ago | parent | next [-]

Relative to the hype they've been spinning to attract investment, casting the launch and commercialization of ChatGPT as their greatest achievement really is a quite significant downgrade, especially given that they really only got there first because they were the first entity reckless enough to deploy such a tool to the public.

It's easy to forget what smart, connected people were saying about how AI would evolve by <current date> ~a year ago, when in fact what we've gotten since then is a whole bunch of diminishing returns and increasingly sketchy benchmark shenanigans. I have no idea when a real AGI breakthrough will happen, but if you're a person who wants it to happen (I am not), you have to admit to yourself that the last year or so has been disappointing---even if you won't admit it to anybody else.

littlestymaar 3 days ago | parent | prev | next [-]

ChatGPT was released two and a half years ago though. Pretty sure that at some point Sam Altman had promised us AGI by now.

The person you're responding to is correct that OpenAI feels a lot more stagnant than other players (like Google, which was nowhere to be seen even one year and a half ago and now has the leading model on pretty much every metric, but also DeepSeek, who built a competitive model in a year that runs for much cheaper).

kadushka 3 days ago | parent [-]

Google has the leading model on pretty much every metric

Correction: Google had the leading model for three weeks. Today it’s back to the second place.

littlestymaar 3 days ago | parent [-]

press X to doubt

o3-mini wasn't even the second place for non-STEM tasks, and in today's announcement they don't even publish benchmarks for those. What's impressive about Gemini 2.5 pro (and was also really impressive with R1) is how good the model is for a very broad range of tasks, not just benchmaxing on AIME.

kadushka 3 days ago | parent [-]

I had a philosophical discussion with o3 model earlier today. It was much better than 2.5 pro. In fact it was pretty much what I would expect from a professional philosopher.

littlestymaar 3 days ago | parent [-]

I'm not expecting someone paying $200 a month to access something to be objective about that particular something.

Also “what I would expect from a professional philosopher”, is that your argument, really?

kadushka 3 days ago | parent [-]

I’m paying $20/mo, and I’m paying the same for Gemini and for Claude.

What’s wrong with my argument? You questioned the performance of the model on non-STEM tasks, and I gave you my impression.

littlestymaar 3 days ago | parent [-]

Writing philosophy that looks convincing has been a thing LLM do well since the first release ChatGPT back in 2022 (in my country back in early 2023, TV featured a kind of competition between ChatGPT and a philosopher turned media personality, with university professors blindly reviewing both essays and attempting to determine which was whom).

To have an idea about how good a model is on non-STEM tasks, you need to challenge it on stuff that is harder than this for LLMs, like summarization without hallucination or creative writing. OpenAI's nonthinking model are usually very good on these, but not their thinking models, whereas other players (be it Google, Anthropic or DeepSeek) manage to make models that can be very good at both.

kadushka 2 days ago | parent [-]

I've been discussing a philosophical topic (brain uploading) with all major models in the last two years. This is a topic I've read and thought about for a long time. Until o3, the responses I got from all other models (Gemini 2.5 pro most recently) have been underwhelming - generic, high level, not interesting to an expert. They struggled to understand the points I was making, and ideas I wanted to explore. o3 was the first model that could keep up, and provide interesting insights. It was communicating on a level of a professional in the field, though not an expert on this particular topic - this is a significant improvement over all existing models.

amarcheschi 3 days ago | parent | prev | next [-]

I guess it was related to the last period, rather than the full picture

flkenosad 3 days ago | parent [-]

What are people expecting here honestly? This thread is ridiculous.

spaceywilly 3 days ago | parent | prev | next [-]

They have 500M weekly users now. I would say that counts as doing something.

namaria 2 days ago | parent [-]

While bleeding cash faster than anything else in History.

iLoveOncall 3 days ago | parent | prev | next [-]

ChatGPT was released in 2022, so OP's point stands perfectly well.

echelon 3 days ago | parent [-]

They're rumored to be working on a social network to rival X with the focus being on image generations.

https://techcrunch.com/2025/04/15/openai-is-reportedly-devel...

The play now seems to be less AGI, more "too big to fail" / use all the capital to morph into a FAANG bigtech.

My bet is that they'll develop a suite of office tools that leverage their model, chat/communication tools, a browser, and perhaps a device.

They're going to try to turn into Google (with maybe a bit of Apple and Meta) before Google turns into them.

Near-term, I don't see late stage investors as recouping their investment. But in time, this may work out well for them. There's a tremendous amount of inefficiency and lack of competition amongst the big tech players. They've been so large that nobody else could effectively challenge them. Now there's a "startup" with enough capital to start eating into big tech's more profitable business lines.

refulgentis 3 days ago | parent | next [-]

I don't know how anyone could look at any of this and say ponderously: it's basically the same as Nov 2022 ChatGPT. Thus strategically they're pivoting to social to become too big to fail.

echelon 3 days ago | parent [-]

I mean, it's not fucking AGI/ASI. No amount of LLM flip floppery is going to get us terminators.

If this starts looking differently and the pace picks up, I won't be giving analysis on OpenAI anymore. I'll start packing for the hills.

But to OpenAI's credit, I also don't see how minting another FAANG isn't an incredible achievement. Like - wow - this tech giant was willed into existence. Can't we marvel at that a little bit without worrying about LLMs doing our taxes?

refulgentis 3 days ago | parent | next [-]

I don't know what AGI/ASI means to you.

I'm bullish on the models, and my first quiet 5 minutes after the announcement was spent thinking how many of the people I walked past days would be different if the computer Just Did It(tm) (I don't think their day would be different, so I'm not bullish on ASI-even-if-achieved, I guess?)

I think binary analysis that flips between "this is a propped up failure, like when banks get bailouts" and "I'd run away from civilization" isn't really worth much.

kadushka 3 days ago | parent | prev [-]

So to you AGI == terminators? Interesting.

hu3 3 days ago | parent | prev | next [-]

I appreciate the info and I have a question:

Why would anyone use a social network run by Sam Altman? No offense but his reputation is chaotic neutral to say the least.

Social networks require a ton of momentum to get going.

BlueSky already ate all the momentum that X lost.

flkenosad 3 days ago | parent | next [-]

Social networks have to be the most chaotic neutral thing ever made. It's like, "hey everyone! Come share what ever you want on my servers!"

echelon 3 days ago | parent | prev [-]

Most people don't care about techies or tech drama. They just use the platforms their friends do.

ChatGPT images are the biggest thing on social media right now. My wife is turning photos of our dogs into people. There's a new GPT4o meme trending on TikTok every day. Using GPT4o as the basis of a social media network could be just the kickstart a new social media platform needs.

fkyoureadthedoc 3 days ago | parent | prev | next [-]

Not surprising. Add comments to sora.com and you've got a social network.

echelon 3 days ago | parent [-]

Seriously. The users on sora.com are already trying to. They're sending messages to each other with the embedded image text and upvoting it.

GPT 4o and Sora are incredibly viral and organic and it's taking over TikTok, Instagram, and all other social media.

If you're not watching casual social media you might miss it, but it's nothing short of a phenomenon.

ChatGPT is now the most downloaded app this month. Images are the reason for that.

chucky_z 3 days ago | parent [-]

Honestly I popped on sora.com the other day and the memes are great. I can totally understand where folks are coming from and why this is happening.

paul7986 3 days ago | parent | prev [-]

chatGPT should be built into my iMessage threads with friends. @chatGPT "Is there an evening train on Thursdays from Brussels to Berlin?" Something a friend and I were discussing but we had to exit out of iMessage and use GPT then back to iMessage.

For UX The GPT info in the thread would be collapsed by default and both users have the discretion to click to expand the info.

swyx 3 days ago | parent | prev [-]

seriously. the level of arrogance combined with ignorance is awe inspiring.

buzzerbetrayed 3 days ago | parent [-]

True. They've blown their absolutely massive lead with power users to Anthropic and Google. So they definitely haven't done nothing.

nopinsight 3 days ago | parent | prev | next [-]

Research by METR suggests that frontier LLMs can perform software tasks over exponentially longer time horizon required for human engineers, with ~7-month for each doubling. o3 is above the trend line.

https://x.com/METR_Evals/status/1912594122176958939

—-

The AlexNet paper which kickstarted the deep learning era in 2012 was ahead of the 2nd-best entry by 11%. Many published AI papers then advanced SOTA by just a couple percentage points.

o3 high is about 9% ahead of o1 high on livebench.ai and there are also quite a few testimonials of their differences.

Yes, AlexNet made major strides in other aspects as well but it’s been just 7 months since o1-preview, the first publicly available reasoning model, which is a seminal advance beyond previous LLMs.

It seems some people have become desensitized to how rapidly things are moving in AI, despite its largely unprecedented pace of progress.

Ref:

- https://proceedings.neurips.cc/paper_files/paper/2012/file/c...

- https://livebench.ai/#/

kadushka 3 days ago | parent [-]

Imagenet had improved the error rate by 100*11/25=44%.

o1 to o3 error rate went from 28 to 19, so 100*9/28=32%.

But these are meaningless comparisons because it’s typically harder to improve already good results.

paxys 3 days ago | parent | prev | next [-]

OpenAI isn't selling GPT-4 or o1 or o4-mini or turbo or whatever else to the general public. These announcements may as well be them releasing GPT v12.582.599385. No one outside of a small group of nerds cares. The end consumer is going to chatgpt.com and typing things in the box.

astrange 3 days ago | parent | next [-]

They have an enterprise business too. I think it's relevant for that.

M4v3R 3 days ago | parent [-]

And that’s exactly why their model naming and release process looks like this right now.

MoonGhost 3 days ago | parent | prev [-]

$20 Plus subscription give access to o1 and Deep Research (10 uses/month). I'm pretty sure general public can get access through API as well.

swat535 3 days ago | parent [-]

Right and most people are not going to spend 200$+ /mo on ChatGPT.. Maybe businesses will but at this point they have too many choices.

w10-1 3 days ago | parent | prev | next [-]

> This is just getting to be a bit much, seems like they are > trying to cover for the fact that they haven't actually done much

Or perhaps they're trying to make some important customers happy by showing movement on areas the customers care about. Subjectively, customers get locked in by feeling they have the inside track, and these small tweaks prove that. Objectively, the small change might make a real difference to the customer's use case.

Similarly, it's important to force development teams to actually ship, and shipping more frequently reduces risk, so this could reflect internal discipline.

As for media buzz, OpenAI is probably trying to tamp that down; they have plenty of first-mover advantage. More puffery just makes their competitors seem more important, and the risk to their reputation of a flop is a lot larger than the reward of the next increment.

As for "a bit much", before 2023 I was thinking I could meaningfully track progress and trade-off's in selecting tech, but now the cat is not only out of the bag, it's had more litters than I can count. So, yeah - a bit much!

sksxihve 3 days ago | parent [-]

> Or perhaps they're trying to make some important customers happy by showing movement on areas the customers care about

Or make important investors happy, they need to justify the latest $40 billion round

amarcheschi 3 days ago | parent | prev | next [-]

The old Chinese strategy of having 7343 different phone models with almost the same specs to confuse the customer better

kylehotchkiss 3 days ago | parent | next [-]

This sounds like recent Dell and Lenovo strategies

whalesalad 3 days ago | parent [-]

recent? they've been doing this for decades.

person a: "I just got an new macbook pro!"

person b: "Nice! I just got a Lenovo YogaPilates Flipfold XR 3299 T92 Thinkbookpad model number SRE44939293X3321"

...

person a: "does that have oled?"

person b: "Lol no silly that is model SRE44939293XB3321". Notice the B in the middle?!?! That is for OLED.

anizan 3 days ago | parent [-]

They should launch o999 and count backwards for each release till they hit oagi

3 days ago | parent | prev | next [-]
[deleted]
MoonGhost 3 days ago | parent | prev [-]

not only that. filling search lists on eBay with your products is old sellers' tactics. Try to search for used Dell workstation or server and you will see pages and pages from the same seller.

kristofferR 3 days ago | parent | prev | next [-]

To use that criticism for this release ain't really fair, as these will replace the old models (o3 will replace o1, o4-mini will replace o3-mini).

On a more general level - sure, but they aren't planning to use this release to add a larger number of models, it's just that deprecating/killing the old models can't be done overnight.

drcongo 3 days ago | parent [-]

As someone who doesn't use anything OpenAI (for all the reasons), I have to agree with the GP. It's all baffling. Why is there an o3-mini and an o4-mini? Why on earth are there so many models?

Once you get to this point you're putting the paradox of choice on the user - I used to use a particular brand toothpaste for years until it got to the point where I'd be in the supermarket looking at a wall of toothpaste all by the same brand with no discernible difference between the products. Why is one of them called "whitening"? Do the others not do that? Why is this one called "complete" and that one called "complete ultra"? That would suggest that the "complete" one wasn't actually complete. I stopped using that brand of toothpaste as it become impossible to know which was the right product within the brand.

If I was assessing the AI landscape today, where the leading models are largely indistinguishable in day to day use, I'd look at OpenAI's wall of toothpaste and immediately discount them.

tedsanders 3 days ago | parent | next [-]

(I work at OpenAI.)

In ChatGPT, o4-mini is replacing o3-mini. It's a straight 1-to-1 upgrade.

In the API, o4-mini is a new model option. We continue to support o3-mini so that anyone who built a product atop o3-mini can continue to get stable behavior. By offering both, developers can test both and switch when they like. The alternative would be to risk breaking production apps whenever we launch a new model and shut off developers without warning.

I don't think it's too different from what other companies do. Like, consider Apple. They support dozens of iPhone models with their software updates and developer docs. And if you're an app developer, you probably want to be aware of all those models and docs as you develop your app (not an exact analogy). But if you're a regular person and you go into an Apple store, you only see a few options, which you can personalize to what you want.

If you have concrete suggestions on how we can improve our naming or our product offering, happy to consider them. Genuinely trying to do the best we can, and we'll clean some things up later this year.

Fun fact: before GPT-4, we had a unified naming scheme for models that went {modality}-{size}-{version}, which resulted in names like text-davinci-002. We considered launching GPT-4 as something like text-earhart-001, but since everyone was calling it GPT-4 anyway, we abandoned that system to use the name GPT-4 that everyone had already latched onto. Kind of funny how our unified naming scheme originally made room for 999 versions, but we didn't make it past 3.

daveguy 3 days ago | parent | next [-]

Have any of the models been deprecated? It seems like a deprecation plan and definition of timelines would be extraordinarily helpful.

I have not seen any sort of "If you're using X.122, upgrade to X.123, before 202X. If you're using X.120, upgrade to anything before April 2026, because the model will no longer be available on that date." ... Like all operating systems and hardware manufacturers have been doing for decades.

Side note, it's amusing that stable behavior is only available on a particular model with a sufficiently low temperature setting. As near-AGI shouldn't these models be smart enough to maintain consistency or improvement from version to version?

tedsanders 3 days ago | parent [-]

Yep, we have a page of announced API deprecations here: https://platform.openai.com/docs/deprecations

It's got all deprecations, ordered by date of announcement, alongside shutdown dates and recommended replacements.

Note that we use the term deprecated to mean slated for shutdown, and shutdown to mean when it's actually shut down.

In general, we try to minimize developer pain by supporting models for as long as we reasonably can, and we'll give a long heads up before any shutdown. (GPT-4.5-preview was a bit of an odd case because it was launched as a potentially temporary preview, so we only gave a 3-month notice. But generally we aim for much longer notice.)

meander_water 3 days ago | parent [-]

On that page I don't see any mention of o3-mini. Is o3-mini a legacy model now which is slated to be deprecated later on?

tedsanders 3 days ago | parent [-]

Nothing announced yet.

Our hypothesis is that o4-mini is a much better model, but we'll wait to hear feedback from developers. Evals only tell part of the story, and we wouldn't want to prematurely deprecate a model that developers continue to find value in. Model behavior is extremely high dimensional, and it's impossible to prevent regression on 100% use cases/prompts, especially if those prompts were originally tuned to the quirks of the older model. But if the majority of developers migrate happily, then it may make sense to deprecate at some future point.

We generally want to give developers as stable as an experience as possible, and not force them to swap models every few months whether they want to or not. Personally, I want developers to spend >99% of their time thinking about their business and <1% of their time thinking about what the OpenAI API is requiring of them.

dmd 3 days ago | parent | prev [-]

Any idea when v1/models will be updated? As of right now, https://api.openai.com/v1/models has "id": "o3-mini-2025-01-31" and "id": "o3-mini", but no just 'o3'.

tedsanders 3 days ago | parent [-]

Ah, I know this is a pain, but by default o3 is only available to developers on tiers 4–5.

If you're in tiers 1–3, you can still get access - you just need to verify your org with us here:

https://help.openai.com/en/articles/10910291-api-organizatio...

I recognize that verification is annoying, but we eventually had to resort to this as otherwise bad actors will create zillions of accounts to violate our policies and/or avoid paying via credit card fraud/etc.

dmd 3 days ago | parent [-]

Aha! Verified and now I see o3. Thanks.

petesergeant 3 days ago | parent | prev | next [-]

> Why is there an o3-mini and an o4-mini? Why on earth are there so many models?

Because if they removed access to o3-mini — which I have tested, costed, and built around — I would be very angry. I will probably switch to o4-mini when the time is right.

TuxSH 3 days ago | parent [-]

They just did that, at least for chat

petesergeant 2 days ago | parent [-]

It seems clear to me I would have built an app around the API, not the chat window.

mkozlows 3 days ago | parent | prev | next [-]

They keep a lot of models around for backward compatibility for API users. This is confusing, but not inherently a bad idea.

louthy 3 days ago | parent | prev [-]

You could develop an AI model to help pick the correct AI model.

Now you’ve got 18 problems.

skygazer 3 days ago | parent [-]

I think you're trying to re-contextualize the old Standards joke, but I actually think you're right -- if a front end model could dispatch as appropriate to the best backend model for a given prompt, and turn everything into a high level sort of mixture of models, I think that would be great, and a great simplifying step. Then they can specialize and optimize all they want, CPU goes down, responses get better and we only see one interface.

louthy 3 days ago | parent | next [-]

> I think you're trying to re-contextualize the old Standards joke

Regex joke [1], but the standards joke will do just fine also :)

[1] Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

calmoo 3 days ago | parent | prev [-]

Isn't this basically the idea of agents?

mrcwinn 3 days ago | parent | prev | next [-]

Well, in fairness, Anthropic has less because 1) they started later, 2) could learn from competitors' mistakes, 3) focused on enterprise and not consumer, 4) have fewer resources.

The point is taken — and OpenAI agrees. They have said they are actively working on simplifying the offering. I just think it's a bit unfair. We have perfect hindsight today here on HackerNews and also did zero of the work to produce the product.

whalesalad 3 days ago | parent | prev | next [-]

Model fatigue is a real thing - Particularly with their billing model that is wildly different from model to model and gives you more headroom as you spend more. We spend a lot of time and effort running tests across many models to balance for that cost/performance ratio. When you can run 300k tokens per min on a shittier model, or 10k tokens per min on a better model - you want to use the cheaper model but if the performance isn't there then you gotta pivot. Can I use tools here? Can I use function calling here? Do I use the chat API, the chat completions API, or the responses API? Do either of those work with the model I want to use, or only with other models?

I almost wonder if this is intentional ... because when you create a quagmire of insane inter-dependent billing scenarios you end up with a product like AWS that can generate substantial amounts of revenue from sheer ignorance or confusion. Then you can hire special consultants to come in and offer solutions to your customers in order to wade through the muck on your behalf.

Dealing with OpenAI's API's is a straight up nightmare.

crowcroft 3 days ago | parent | prev | next [-]

Most industries, or categories go through cycles of fragmentation and consolidation.

AI is currently in a high growth expansion phase. The leads to rapid iteration and fragmentation because getting things released is the most important thing.

When the models start to plateau or the demands on the industry are for profit you will see consolidation start.

airstrike 3 days ago | parent [-]

having many models from the same company in some haphazard strategy doesn't equate to "industry fragmentation". it's just confusion

crowcroft 3 days ago | parent [-]

OpenAI's continued growth and press coverage relative to their peers leads to me to believe it isn't *just* confusion, even if it is confusing.

airstrike 3 days ago | parent [-]

I'd attribute that more to first mover advantage than a benefit from poor naming choices, though I do think they are likely to misattribute that to a causal relationship so that they keep doing the latter

resters 3 days ago | parent | prev | next [-]

They do this because people like to have predictability. A new model may behave quite differently on something that’s important for a use case.

Also, there are a lot of cases where very small models are just fine and others where they are not. It would always make sense to have the smallest highest performing models available.

t-writescode 3 days ago | parent [-]

I have *no idea* why you're being downvoted on this.

If I want to take advantage of a new model, I must validate that the structured queries I've made to the older models still work on the new models.

The last time I did a validation and update. Their Responses. Had. Changed.

API users need dependability, which means they need older models to keep being usable.

resters 3 days ago | parent [-]

> I have no idea why you're being downvoted on this.

I probably offended someone at YC and my account is being punished.

jstummbillig 3 days ago | parent | prev | next [-]

I can not believe that we feel that this is what's most worth talking about here (by visibility). At this point I truly wonder if AI is what will make HN side with the luddites.

siva7 3 days ago | parent | next [-]

Is there some new HN with more insightful discussions?

flkenosad 3 days ago | parent | prev [-]

It's giving "they took our jerbs"

Seattle3503 3 days ago | parent | prev | next [-]

This seems like a perfect use case for "agentic" AI. OpenAI can enrich the context window with the strengths and weakness of each model, and when a user prompts for something the model can say "Hey, I'm gonna switch to another model that is better at answering this sort of question." and the user can accept or reject.

kgeist 3 days ago | parent | prev | next [-]

> This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model

OpenAI's progress lately:

  2024 December - first reasoning model (official release)

  2025 February - deep search

  2025 March - true multi-modal image generation

  2025 April - reasoning model with tools
I'm not sure why people say they haven't done much. We couldn't even dream of stuff like this five years ago, and now releasing groundbreaking/novel features every month is considered "meh"... I think we're spoiled and can't appreciate anything anymore :)
jasondigitized 3 days ago | parent | prev | next [-]

If there are incremental gains in each release, why would they hold them back? The amount of exhaust coming off of each release is gold for the internal teams. The naming convention is bad, and the CPO just admitted as much on Lenny's podcast, but I am not sure why incremental releases is a bad thing.

vunderba 3 days ago | parent | prev | next [-]

> This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much.

Did you miss the 4o image generation announcement from roughly three week ago?

https://news.ycombinator.com/item?id=43474112

Combining a multimodal LLM+ImageGen puts them pretty significantly ahead of the curve at least in that domain.

Demonstration of the capabilities:

https://mordenstar.com/blog/chatgpt-4o-images

irthomasthomas 3 days ago | parent | prev | next [-]

That would explain why they all have a knowledge cutoff (likely training date) of ~August 2023.

3 days ago | parent | prev | next [-]
[deleted]
wilg 3 days ago | parent | prev | next [-]

There are 9 models in the ChatGPT model picker and they have stated that it's their goal to get rid of the model picker because everyone finds it annoying.

danielmarkbruce 3 days ago | parent | prev | next [-]

Think for 30 seconds about why they might in good faith do what they do.

Do you use any of them? Are you a developer? Just because a model is non-deterministic it doesn't mean developers don't want some level of consistency, whether it be about capabilities, cost, latency, call structure etc.

ren_engineer 3 days ago | parent | prev | next [-]

you'd think they could use AI to interpret the best model for your use case so you don't even have to think about it. Run the first few API calls in parallel, grade the result, and then send the rest to whatever works best

onlyrealcuzzo 3 days ago | parent | prev [-]

> All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones.

That's not a problem in and of itself. It's only a problem if the models aren't good enough.

Judging by ChatGPT's adoption, people seem to think they're doing just fine.