My interpretation of the progress.

3.5 to 4 was the most major leap. It went from being a party trick to legitimately useful sometimes. It did hallucinate a lot but I was still able to get some use out of it. I wouldn't count on it for most things however. It could answer simple questions and get it right mostly but never one or two levels deep.

I clearly remember 4o was also a decent leap - the accuracy increased substantially. It could answer niche questions without much hallucination. I could essentially replace it with Google for basic to slightly complex fact checking.

* 4o was the first time I actually considered paying for this tool. The $20 price was finally worth it.

o1 models were also a big leap over 4o (I realise I have been saying big leap too many times but it is true). The accuracy increased again and I got even more confident using it for niche topics. I would have to verify the results much less often. Oh and coding capabilities dramatically improved here in the thinking model. o1 essentially invented oneshotting - slightly non trivial apps could be made just by one prompt for the first time.

o3 jump was incremental and so was gpt 5.

▲

furyofantares 4 days ago | parent | next [-]

I have a theory about why it's so easy to underestimate long-term progress and overestimate short-term progress.

Before a technology hits a threshold of "becoming useful", it may have a long history of progress behind it. But that progress is only visible and felt to researchers. In practical terms, there is no progress being made as long as the thing is going from not-useful to still not-useful.

So then it goes from not-useful to useful-but-bad and it's instantaneous progress. Then as more applications cross the threshold, and as they go from useful-but-bad to useful-but-OK, progress all feels very fast. Even if it's the same speed as before.

So we overestimate short term progress because we overestimate how fast things are moving when they cross these thresholds. But then as fewer applications cross the threshold, and as things go from OK-to-decent instead of bad-to-OK, that progress feels a bit slowed. And again, it might not be any different in reality, but that's how it feels. So then we underestimate long-term progress because we've extrapolated a slowdown that might not really exist.

I think it's also why we see a divide where there's lots of people here who are way overhyped on this stuff, and also lots of people here who think it's all totally useless.

▲

svantana 4 days ago | parent | next [-]

> why it's so easy to underestimate long-term progress and overestimate short-term progress

I dunno, I think that's mostly post-hoc rationalization. There are equally many cases where long-term progress has been overestimated after some early breakthroughs: think space travel after the moon landing, supersonic flight after the concorde, fusion energy after the H-bomb, and AI after the ENIAC. Turing himself guesstimated that human-level AI would arrive in the year 2000. The only constant is that the further into the future you go, the harder it is to predict.

▲

brookst 4 days ago | parent [-]

With the exception of h-bomb/fusion and ENIAC/AI, I think all of those examples reflect a change in priority and investment more than anything. There was a trajectory of high investment / rapid progress, then market and social and political drivers changed and space travel / supersonic flight just became less important.

▲

svantana 4 days ago | parent [-]

That's the conceit for the tv show For All Mankind - what if the space race didn't end? But I don't buy it, IMO the space race ended for material reasons rather than political. Space is just too hard and there is not much of value "out there". But regardless, it's a futile excuse, markets and politics should be part of any serious prognostication.

▲

ndiddy 4 days ago | parent | next [-]

I think it was a combination of the two. The Apollo program was never popular. It took up an enormous portion of the federal budget, which the Republicans argued was fiscally unwise and the Democrats argued that the money should have been used to fund domestic social programs. In 1962, the New York Times noted that the projected Apollo program budget could have instead been used to create over 100 universities of a similar size to Harvard, build millions of homes, replace hundreds of worn-out schools, build hundreds of hospitals, and fund disease research. The Apollo program's popularity peaked at 53% just after the moon landing, and by April 1970 it was back down to 40%. It wasn't until the mid-80s that the majority of Americans thought that the Apollo program was worth it. Because of all this, I think it's inevitable that the Apollo program would wind down once it had achieved its goal of national prestige.

	▲	burnerRhodo 3 days ago \| parent [-]
		but think about that... If in the 70's they would have used the budget to build millions of homes. The moral there is tech progress does not always mean social progress.

▲

4 days ago | parent | prev | next [-]

[deleted]

▲

brookst 4 days ago | parent | prev [-]

I think the space race ended because we got all the benefit available, which wasn’t really in space anyway, it was the ancillary technical developments like computers, navigation, simulation, incredible tolerances in machining, material science, etc.

We’re seeing a resurgence in space because there is actually value in space itself, in a way that scales beyond just telecom satellites. Suddenly there are good reasons to want to launch 500 times a year.

There was just a 50-year discontinuity between the two phases.

	▲	ghurtado 4 days ago \| parent [-]
		> I think the space race ended because we got all the benefit available We did get all the things that you listed but you missed the main reason it was started: military superiority. All of the other benefits came into existence in service of this goal.

▲

strken 4 days ago | parent | prev | next [-]

I think that for a lot of examples, the differentiating factor is infrastructure rather than science.

The current wave of AI needed fast, efficient computing power in massive data centres powered by a large electricity grid. The textiles industry in England needed coal mining, international shipping, tree trunks from the Baltic region, cordage from Manilla, and enclosure plus the associated legal change plus a bunch of displaced and desperate peasantry. Mobile phones took portable radio transmitters, miniaturised electronics, free space on the spectrum, population density high enough to make a network of towers economically viable, the internet backbone and power grid to connect those towers to, and economies of scale provided by a global shipping industry.

Long term progress seems to often be a dance where a boom in infrastructure unlocks new scientific inquiry, then science progresses to the point where it enables new infrastructure, then the growth of that new infrastructure unlocks new science, and repeat. There's also lag time based on bringing new researchers into a field and throwing greater funding into more labs, where the infrastructure is R&D itself.

▲

xbmcuser 4 days ago | parent | prev | next [-]

There is also an adoption curve. The people that grew up without it wont use it as much as children that grew up with it and knowing how to use it. My sister is an admin in a private school (Not in USA) and the owner of the school is someone willing to adopt new tech very quickly. So he got all the school admin subscriptions for chatgpt. At the time my sister used to complain a lot about being over worked and having to bring work home everyday.

2 years later my sister uses it for almost everything and despite her duties increasing she says she gets a lot more done rarely has to bring work home. And in the past they had an English major specially to go over all correspondences to make sure there were no grammatical or language mistakes that person was assigned a different role as she was no longer needed. I think as newer generations used to using LLM for things start getting into the work force and higher roles the real effect of LLM will be felt more broadly as currently apart from early adopters the number of people that use LLM for all the things that they can be used for is still not that high.

▲

hirako2000 4 days ago | parent | prev | next [-]

GPT3 is when the mass started to get exposed to this tech, it felt like a revolution.

Got 3.5 felt like things were improving super super fast and created that feeling the near feature will be unbelievable.

Got to 4/o series, it felt things had improved but users weren't as thrilled as with the leap to 3.5

You can call that bias, but clearly version 5 improvements displays an even greater slow down, that's 2 long years since gp4.

For context:

- gpt 3 got out in 2020

- gpt 3.5 in 2022

- gpt 4 in 2023

- gpt 4o and clique, 2024

After 3.5 things slowed down, in term of impact at least. Larger context window, multi-modality, mixture of experts, and more efficienc: all great, significant features, but all pale compared to the impact made by RLHF already 4 years ago.

▲

vczf 4 days ago | parent | prev | next [-]

The more general pattern is “slowly at first, then all at once.”

It almost universally describes complex systems.

▲

heywoods 4 days ago | parent | prev [-]

Your threshold theory is basically Amara's Law with better psychological scaffolding. Roy Amara nailed the what ("we tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run") [1] but you're articulating the why better than most academic treatments. The invisible-to-researchers phase followed by the sudden usefulness cascade is exactly how these transitions feel from the inside.

This reminds me of the CPU wars circa 2003-2005. Intel spent years squeezing marginal gains out of Pentium 4's NetBurst architecture, each increment more desperate than the last. From 2003 to 2005, Intel shifted development away from NetBurst to focus on the cooler-running Pentium M microarchitecture [2]. The whole industry was convinced we'd hit a fundamental wall. Then boom, Intel released dual-core processors under the Pentium D brand in May 2005 [2] and suddenly we're living in a different computational universe.

But teh multi-core transition wasn't sudden at all. IBM shipped the POWER4 in 2001, the first non-embedded microprocessor with two cores on a single die [3]. Sun had been preaching parallelism since the 90s. It was only "sudden" to those of us who weren't paying attention to the right signals.

Which brings us to the $7 trillion question: where exactly are we on the transformer S-curve? Are we approaching what Richard Foster calls the "performance plateau" in "Innovation: The Attacker's Advantage" [4], where each new model delivers diminishing returns? Or are we still in that deceptive middle phase where progress feels linear but is actually exponential?

The pattern-matching pessimist in me sees all the classic late-stage S-curve symptoms. The shift from breakthrough capabilities to benchmark gaming. The pivot from "holy shit it can write poetry" to "GPT-4.5-turbo-ultra is 3% better on MMLU." The telltale sign of technological maturity: when the marketing department works harder than the R&D team.

But the timeline compression with AI is unprecedented. What took CPUs 30 years to cycle through, transformers have done in 5. Maybe software cycles are inherently faster than hardware. Or maybe we've just gotten better at S-curve jumping (OpenAI and Anthropic aren't waiting for the current curve to flatten before exploring the next paradigm).

As for whether capital can override S-curve dynamics... Christ, one can dream.. IBM torched approximately $5 billion on Watson Health acquisitions alone (Truven, Phytel, Explorys, Merge) [5]. Google poured resources into Google+ before shutting it down in April 2019 due to low usage and security issues [6]. The sailing ship effect (coined by W.H. Ward in 1967, where new technology accelerates innovation in incumbent technology)[7] si real, but you can't venture-capital your way past physics.

I think we can predict all this capital pouring in to AI might actually accelerate S-curve maturation rather than extend it. All that GPU capacity, all those researchers, all that parallel experimentation? We're speedrunning the entire innovation cycle, which means we might hit the plateau faster too.

You're spot on about the perception divide imo. The overhyped folks are still living in 2022's "holy shit ChatGPT" moment, while the skeptics have fast-forwarded to 2025's "is that all there is?" Both groups are right, just operating on different timescales. It's Schrödinger's S-curve where we things feel simultaneously revolutionary and disappointing, depending on which part of the elephant you're touching.

The real question I have is whether we're approaching the limits of the current S-curve (we probably are), but whether there's another curve waiting in the wings. I'm not a researcher in this space nor do I follow the AI research beat to weigh in but hopefully someone in the thread can? With CPUs, we knew dual-core was coming because the single-core wall was obvious. With transformers, the next paradigm is anyone's guess. And that uncertainty, more than any technical limitation, might be what makes this moment feel so damn weird.

References: [1] "Amara's Law" https://en.wikipedia.org/wiki/Roy_Amara [2] "Pentium 4" https://en.wikipedia.org/wiki/Pentium_4 [3] "POWER4" https://en.wikipedia.org/wiki/POWER4 [4] Innovation: The Attacker's Advantage - https://annas-archive.org/md5/3f97655a56ed893624b22ae3094116... [5] IBM Watson Slate piece - https://slate.com/technology/2022/01/ibm-watson-health-failu... [6] "Expediting changes to Google+" - https://blog.google/technology/safety-security/expediting-ch... [7] "Sailing ship effect" https://en.wikipedia.org/wiki/Sailing_ship_effect.

▲

techpineapple 4 days ago | parent [-]

One thing I think is weird in the debate is it seems people are equating LLMs with CPU’s, this whole category of devices that process and calculate and can have infinite architecture and innovation. But what if LLMs are more like a specific implementation like DSP’s, sure lots of interesting ways to make things sound better, but it’s never going to fundamentally revolutionize computing as a whole.

	▲	vczf 4 days ago \| parent [-]
		I think LLMs are more like the invention of high level programming languages when all we had before was assembly. Computers will be programmable and operable in “natural language”—for all of its imprecision and mushiness.

▲

stavros 4 days ago | parent | prev | next [-]

All the replies are spectacularly wrong, and biased by hindsight. GPT-1 to GPT-2 is where we went from "yes, I've seen Markov chains before, what about them?" to "holy shit this is actually kind of understanding what I'm saying!"

Before GPT-2, we had plain old machine learning. After GPT-2, we had "I never thought I would see this in my lifetime or the next two".

▲

reasonableklout 4 days ago | parent | next [-]

I'd love to know more about how OpenAI (or Alec Radford et al.) even decided GPT-1 was worth investing more into. At a glance the output is barely distinguishable from Markov chains. If in 2018 you told me that scaling the algorithm up 100-1000x would lead to computers talking to people/coding/reasoning/beating the IMO I'd tell you to take your meds.

▲

arugulum 4 days ago | parent | next [-]

GPT-1 wasn't used as a zero-shot text generator; that wasn't why it was impressive. The way GPT-1 was used was as a base model to be fine-tuned on downstream tasks. It was the first case of a (fine-tuned) base Transformer model just trivially blowing everything else out of the water. Before this, people were coming up with bespoke systems for different tasks (a simple example is that for SQuAD a passage-question-answering task, people would have an LSTM to read the passage and another LSTM to read the question, because of course those are different sub-tasks with different requirements and should have different sub-models). One GPT-1 came out, you just dumped all the text into the context, YOLO fine-tuned it, and trivially got state on the art on the task. On EVERY NLP task.

Overnight, GPT-1 single-handedly upset the whole field. It was somewhat overshadowed by BERT and T5 models that came out very shortly after, which tended to perform even better on the pretrain-and-finetune format. Nevertheless, the success of GPT-1 definitely already warrants scaling up the approach.

A better question is how OpenAI decided to scale GPT-2 to GPT-3. It was an awkward in-between model. It generated better text for sure, but the zero-shot performance reported in the paper, while neat, was not great at all. On the flip side, its fine-tuned task performance paled compared to much smaller encoder-only Transformers. (The answer is: scaling laws allowed for predictable increases in performance.)

▲

gnerd00 4 days ago | parent [-]

> Transformer model just trivially blowing everything else out of the water

no, this is the winners rewriting history. Transformer style encoders are now applied to lots and lots of disciplines but they do not "trivially" do anything. The hype re-telling is obscuring the facts of history. Specifically in human language text translation, "Attention is All You Need" Transformers did "blow others out of the water" yes, for that application.

	▲	arugulum 3 days ago \| parent [-]
		My statement was >a (fine-tuned) base Transformer model just trivially blowing everything else out of the water "Attention is All You Need" was a Transformer model trained specifically for translation, blowing all other translation models out of the water. It was not fine-tuned for tasks other than what the model was trained from scratch for. GPT-1/BERT were significant because they showed that you can pretrain one base model and use it for "everything".

▲

hadlock 4 days ago | parent | prev | next [-]

There's a performance plateau with training time and number of parameters and then once you get over "the hump" error rate starts going down again almost linearly. GPT existed before OpenAI but it was theorized that the plateau was a dead end. The sell to VCs in the early gpt3 era was "with enough compute, enough time, and enough parameters... it'll probably just start thinking and then we have AGI". Sometime around the o3 era they realized they'd hit a wall and performance actually started to decrease as they added more parameters and time. But yeah basically at the time they needed money for more compute parameters and time. I would have loved to have been a fly on the wall in those "AGI" pitches. Don't forget Microsoft's agreement with OpenAI specifically concludes with the invention of AGI. at the time getting over the hump it really did look like we were gonna do AGI in a few months.

I'm really looking forward to "the social network" treatment movie about OpenAI whenever that happens

▲

whimsicalism 4 days ago | parent [-]

source? i work in this field and have never heard of the initial plateau you are referring

	▲	reasonableklout 2 days ago \| parent [-]
		Maybe hadlock is thinking of double descent? https://openai.com/index/deep-double-descent/

▲

muzani 4 days ago | parent | prev | next [-]

I don't have a source for this (there's probably no sources from anything back then) but anecdotally, someone at an AI/ML talk said they just added more data and quality went up. Doubling the data doubled the quality. With other breakthroughs, people saw diminishing gains. It's sort of why Sam back then tweeted that he expected the amount of intelligence to double every N years.

I have the feeling they kept on this until GPT-4o (which was a different kind of data).

▲

robrenaud 4 days ago | parent [-]

The input size to output quality mapping is not linear. This is why we are in the regime of "build nuclear power plants to power datacenters". Fixed size improvements in loss require exponential increases in parameters/compute/data.

	▲	brookst 4 days ago \| parent [-]
		Most of the reason we are re-commissioning a nuclear power plant is demand for quantity, not quality. If demand for compute had scaled this fast in the 1970’s, the sudden need for billions of CPUs would not have disproven Moore’s law. It is also true that mere doubling of training data quantity does not double output quality, but that’s orthogonal to power demand at inference time. Even if output quality doubled in that case, it would just mean that much more demand and therefore power needs.

▲

kevindamm 4 days ago | parent | prev | next [-]

Transformers can train models with much larger parameter sizes compared to other model architectures (with the same amount of compute and time), so it has an evident advantage in terms of being able to scale. Whether scaling the models up to multi-billion parameters would eventually pay out was still a bet but it wasn't a wild bet out of nowhere.

▲

4 days ago | parent | prev | next [-]

[deleted]

▲

stavros 4 days ago | parent | prev | next [-]

I assume the cost was just very low? If it was 50-100k, maybe they figured they'd just try and see.

▲

reasonableklout 4 days ago | parent [-]

Oh yes, according to [1], training GPT-2 1.5B cost $50k in 2019 (reproduced in 2024 for $672!).

[1]: https://www.reddit.com/r/mlscaling/comments/1d3a793/andrej_k...

	▲	stavros 4 days ago \| parent [-]
		That makes sense, and it was definitely impressive for $50k.

▲

therein 4 days ago | parent | prev [-]

Probably prior DARPA research or something.

Also slightly tangentially, people will tell me it is that it was new and novel and that's why we were impressed but I almost think things went downhill after ChatGPT 3. I felt like 2.5 (or whatever they called it) was able to give better insights from the model weights itself. The moment tool use became a thing and we started doing RAGs and memory and search engine tool use, it actually got worse.

I am also pretty sure we are lobotomizing the things that would feel closer to critical thinking by training it to be sensitive of the taboo of the day. I suspect earlier ones were less broken due to that.

How would it distinguish and decide between knowing something from training and needing to use a tool to synthesize a response anyway?

▲

ACCount37 4 days ago | parent | prev | next [-]

GPT-2 was the first wake-up call - one that a lot of people slept through.

Even within ML circles, there was a lot of skepticism or dismissive attitudes about GPT-2 - despite it being quite good at NLP/NLU.

I applaud those who had the foresight to call it out as a breakthrough back in 2019.

▲

whimsicalism 4 days ago | parent [-]

i think it was already pretty clear among practitioners by 2018 at the latest

	▲	ACCount37 3 days ago \| parent [-]
		It was obvious that "those AI architectures kick ass at NLP". It wasn't at all obvious that they might go all the way to something like GPT-4. I totally underestimated this back then myself.

▲

faitswulff 4 days ago | parent | prev | next [-]

What you're saying isn't necessarily mutually exclusive to what gp said.

GPT-2 was the most impressive leap in terms of whatever LLMs pass off as cognitive abilities, but GPT 3.5 to 4 was actually the point at which it became a useful tool (I'm assuming to programmers in particular).

GPT-2: Really convincing stochastic parrot

GPT-4: Can one-shot ffmpeg commands

	▲	stavros 4 days ago \| parent [-]
		Sure, but the GP said "the most major leap", and I disagree that that was 3.5 to 4.

▲

paulddraper 4 days ago | parent | prev [-]

That’s true, but not contradictory.

▲

jkubicek 5 days ago | parent | prev | next [-]

> I could essentially replace it with Google for basic to slightly complex fact checking.

I know you probably meant "augment fact checking" here, but using LLMs for answering factual questions is the single worst use-case for LLMs.

▲

rich_sasha 4 days ago | parent | next [-]

I disagree. Some things are hard to Google, because you can't frame the question right. For example you know context and a poor explanation of what you are after. Googling will take you nowhere, LLMs will give you the right answer 95% of the time.

Once you get an answer, it is easy enough to verify it.

▲

mrandish 4 days ago | parent | next [-]

I agree. Since I'm recently retired and no longer code much, I don't have much need for LLMs but refining a complex, niche web search is the one thing where they're uniquely useful to me. It's usually when targeting the specific topic involves several keywords which have multiple plain English meanings that return a flood of erroneous results. Because LLMs abstract keywords to tokens based on underlying meaning, you can specify the domain in the prompt it'll usually select the relevant meanings of multi-meaning terms - which isn't possible in general purpose web search engines. So it helps narrow down closer to the specific needle I want in the haystack.

As other posters said, relying on LLMs for factual answers to challenging questions is error prone. I just want the LLM to give me the links and I'll then assess veracity like a normal web search. I think a web search interface allowed disambiguating multi-meaning keywords might be even better.

▲

yojo 4 days ago | parent [-]

I’ll give you another use: LLMs are really good at unearthing the “unknown unknowns.” If I’m learning a new topic (coding or not) summarizing my own knowledge to an LLM and then asking “what important things am I missing” almost always turns up something I hadn’t considered.

You’ll still want to fact check it, and there’s no guarantee it’s comprehensive, but I can’t think of another tool that provides anything close without hours of research.

	▲	elictronic 4 days ago \| parent [-]
		Coworkers and experts in a field. I can trust them much more but the better they are the less access you have.

▲

LoganDark 4 days ago | parent | prev | next [-]

> Some things are hard to Google, because you can't frame the question right.

I will say LLMs are great for taking an ambiguous query and figuring out how to word it so you can fact check with secondary sources. Also tip-of-my-tongue style queries.

▲

bloudermilk 4 days ago | parent | prev | next [-]

If you’re looking for a possibly correct answer to an obscure question, that’s more like fact finding. Verifying it afterward is the “fact checking” step of that process.

▲

crote 4 days ago | parent | prev | next [-]

A good part of that can probably be attributed to how terrible Google has gotten over the years, though. 15 years ago it was fairly common for me to know something exists, be able to type the right combination of very specific keywords into Google, and get the exact result I was looking for.

In 2025 Google is trying very hard to serve the most profitable results instead, so it'll latch onto a random keyword, completely disregard the rest, and serve me whatever ad-infested garbage it thinks is close enough to look relevant for the query.

It isn't exactly hard to beat that - just bring back the 2010 Google algorithm. It's only a matter of time before LLMs will go down the same deliberate enshittification path.

▲

KronisLV 4 days ago | parent | prev | next [-]

> For example you know context and a poor explanation of what you are after. Googling will take you nowhere, LLMs will give you the right answer 95% of the time.

This works nicely when the LLM has a large knowledgebase to draw upon (formal terms for what you're trying to find, which you might not know) or the ability to generate good search queries and summarize results quickly - with an actual search engine in the loop.

Most large LLM providers have this, even something like OpenWebUI can have search engines integrated (though I will admit that smaller models kinda struggle, couldn't get much useful stuff out of DuckDuckGo backed searches, nor Brave AI searches, might have been an obscure topic).

▲

littlestymaar 4 days ago | parent | prev [-]

It's not the LLM alone though, it's “LLM with web search”, and as such 4o isn't really a leap at all there (IIRC perplexity was using an early Llama version and was already very good, long before OpenAI adding web search to ChatGPT).

▲

mkozlows 4 days ago | parent | prev | next [-]

Modern ChatGPT will (typically on its own; always if you instruct it to) provide inline links to back up its answers. You can click on those if it seems dubious or if it's important, or trust it if it seems reasonably true and/or doesn't matter much.

The fact that it provides those relevant links is what allows it to replace Google for a lot of purposes.

▲

pram 4 days ago | parent | next [-]

It does citations (Grok and Claude etc do too) but I've found when I read the source on some stuff (GitHub discussions and so on) it sometimes actually has nothing to do with what the LLM said. I've actually wasted a lot of time trying to find the actual spot in a threaded conversation where the example was supposedly stated.

▲

sarchertech 4 days ago | parent [-]

Same experience with Google search AI. The links frequently don’t support the assertions, they’ll just say something that might show up in a google search for the assertion.

For example if I’m asking about whether a feature exists in some library, the AI says yes it does and links to a forum where someone is asking the same question I did, but no one answered (this has happened multiple times).

	▲	Nemi 4 days ago \| parent [-]
		It is funny, Perplexity seems to work much better in this use case for me. When I want some sort of "conclusive answer", I use Gemini pro (just what I have available). It is good with coding and formulating thoughts, rewriting text, so on. But when I want to actually search for content on the web for, say, product research or opinions on a topic, Perplexity is so much better than either Gemini or google search AI. It lists reference links for each block of assertions that are EASILY clicked on (unlike Gemini or search AI, where the references are just harder to click on for some reason, not the least of which is that they OPEN IN THE SAME TAB where Perplexity always opens on a new tab). This is often a reddit specific search as I want people's opinions on something. Perplexity's UI for search specifically is the main thing it does just so much better than google's offering is the one thing going for it. I think there is some irony there. Full disclosure, I don't use Anthropic or OpenAI, so this may not be the case for those products.

▲

platevoltage 4 days ago | parent | prev [-]

In my experience, 80% of the links it provides are either 404, or go to a thread on a forum that is completely unrelated to the subject.

Im also someone who refuses to pay for it, so maybe the paid versions do better. who knows.

▲

cout 4 days ago | parent | next [-]

The 404 links are truly bizarre. Nearly every link to github.com seems to be 404. That seems like something that should be trivial for a tool to verify.

▲

weatherlite 4 days ago | parent | next [-]

> The 404 links are truly bizarre. Nearly every link to github.com seems to be 404. That seems like something that should be trivial for a tool to verify. reply

Same issue with Gemini. Intuitively I'd also assume it's trivial to fix but perhaps there's more going on than we think. Perhaps validating every part of a response is a big overhead both financially and might even throw off the model and make it less accurate in other ways.

▲

platevoltage 4 days ago | parent | prev [-]

Yeah. The fact that I can't ask ChatGPT for a source makes the tool way less useful. It will straight up say "I verified all of these links" too.

▲

mh- 4 days ago | parent [-]

As you identified, not paying for it is a big part of the issue.

Running these things is expensive, and they're just not serving the same experience to non-paying users.

One could argue this is a bad idea on their part, letting people get a bad taste of an inferior product. And I wouldn't disagree, but I don't know what a sustainable alternative approach is.

▲

xigoi 4 days ago | parent | next [-]

Surely the cost of sending a few HTTP requests and seeing if they 404 is negligible compared to AI inference.

▲

platevoltage 3 days ago | parent | prev [-]

I would have no issue if the free version of ChatGPT told me straight up “You gotta pay for links and sources”. It doesn’t do that.

	▲	mh- 14 hours ago \| parent [-]
		100% agree with that, as I alluded to in my last sentence. And that honestly seems like it might be a good product strategy in the short term.

▲

mkozlows 4 days ago | parent | prev [-]

That's a thing I've experienced, but not remotely at 80% levels.

	▲	platevoltage 3 days ago \| parent [-]
		It might have been the subject I was researching being insanely niche. I was using it to help me fix an arcade CRT monitor from the 80’s that wasn’t found in many cabinets that made it to the USA. It would spit out numbers that weren’t on the schematic, so I asked for context.

▲

password54321 5 days ago | parent | prev | next [-]

This was true before it could use search. Now the worst use-case is for life advice because it will contradict itself a 100 times over while sounding confident each time on life-altering decisions.

▲

SirHumphrey 4 days ago | parent | prev | next [-]

Most of the value I got from google was just becoming aware that something exists. LLM-s do far better in this regard. Once I know something exists it's usually easy enough to use traditional search to find official documentation or a more reputable source.

▲

oldsecondhand 4 days ago | parent | prev | next [-]

The most useful feature of LLMs is giving sources (with URL preferably). It can cut through a lot of SEO crap, and you still get to factcheck just like with a Google search.

▲

sefrost 4 days ago | parent | next [-]

I like using LLMs and I have found they are incredibly useful writing and reviewing code at work.

However, when I want sources for things, I often find they link to pages that don't fully (or at all) back up the claims made. Sometimes other websites do, but the sources given to me by the LLM often don't. They might be about the same topic that I'm discussing, but they don't seem to always validate the claims.

If they could crack that problem it would be a major major win for me.

▲

joegibbs 4 days ago | parent [-]

It would be difficult to do with a raw model, but a two-step method in a chat interface would work - first the model suggests the URLs, tool call to fetch them and return the actual text of the pages, then the response can be based on that.

	▲	mh- 4 days ago \| parent [-]
		I prototyped this a couple months ago using OpenAI APIs with structured output. I had it consume a "deep thought" style output (where it provides inline citations with claims), and then convert that to a series of assertions and a pointer to a link that supposedly supports the assertion. I also split out a global "context" (the original meaning) paragraph to provide anything that would help the next agents understand what they're verifying. Then I fanned this out to separate (LLM) contexts and each agent verified only one assertion::source pair, with only those things + the global context and some instructions I tuned via testing. It returned a yes/no/it's complicated for each one. Then I collated all these back in and enriched the original report with challenges from the non-yes agent responses. That's as far as I took it. It only took a couple hours to build and it seemed to work pretty well.

▲

IgorPartola 4 days ago | parent | prev [-]

From what I have seen, a lot of what it does is read articles also written by AI or forum posts with all the good and bad that comes with that.

▲

cm2012 4 days ago | parent | prev | next [-]

They outperform asking humans, unless you are asking an expert. On average

▲

lottin 4 days ago | parent [-]

When I have a question, I don't usually "ask" that question and expect an answer. I figure out the answer. I certainly don't ask the question to a random human.

▲

gnerd00 4 days ago | parent [-]

you ask yourself .. for most people, that means closer to average reply, from yourself, when you try to figure it out.

There is a working paper from McKinnon Consulting in Canada that states directly that their definition of "General AI" is when the machine can match or exceed fifty percent of humans who are likely to be employed for a certain kind of job. It implies that low-education humans are the test for doing many routine jobs, and if the machine can beat 50% (or more) of them with some consistency, that is it.

	▲	lottin 3 days ago \| parent [-]
		By definition the average answer will be average, that's kind of a tautology. The point is that figuring things out is an essential intellectual skill. Figuring things out will make you smarter. Having a machine figure things out for you will make you dumber. By the way, doing a better job than the average human is NOT a sign of intelligence. Through history we have invented plenty of machines that are better at certain tasks than us. None of them are intelligent.

▲

yieldcrv 4 days ago | parent | prev | next [-]

It covers 99% of my use cases. And it is googling behind the scenes in ways I would never think to query and far faster.

When I need to cite a court case, well the truth is I'll still use GPT or a similar LLM, but I'll scrutinize it more and at the bare minimum make sure the case exists and is about the topic presented, before trying to corroborate the legal strategy with a new context window, different LLM, google, reddit, and different lawyer. At least I'm no longer relying on my own understanding, and what 1 lawyer procedurally generates for me.

▲

Spivak 5 days ago | parent | prev | next [-]

It doesn't replace legitimate source funding but LLM vs the top Google results is no contest which is more about Google or the current state of the web than the LLMs at this point.

▲

lightbendover 4 days ago | parent | prev | next [-]

[dead]

▲

marsven_422 4 days ago | parent | prev | next [-]

[dead]

▲

simianwords 5 days ago | parent | prev [-]

Disagree. You have to try really hard and go very niche and deep for it to get some fact wrong. In fact I'll ask you to provide examples: use GPT 5 with thinking and search disabled and get it to give you inaccurate facts for non niche, non deep topics.

Non niche meaning: something that is taught at undergraduate level and relatively popular.

Non deep meaning you aren't going so deep as to confuse even humans. Like solving an extremely hard integral.

Edit: probably a bad idea because this sort of "challenge" works only statistically not anecdotally. Still interesting to find out.

▲

malfist 5 days ago | parent | next [-]

Maybe you should fact check your AI outputs more if you think it only hallucinates in niche topics

▲

simianwords 5 days ago | parent [-]

The accuracy is high enough that I don't have to fact check too often.

▲

platevoltage 4 days ago | parent | next [-]

I totally get that you meant this in a nuanced way, but at face value it sort of reads like...

Joe Rogan has high enough accuracy that I don't have to fact check too often. Newsmax has high enough accuracy that I don't have to fact check too often, etc.

If you accept the output as accurate, why would fact checking even cross your mind?

▲

gspetr 4 days ago | parent | next [-]

Not a fan of that analogy.

There is no expectation (from a reasonable observer's POV) of a podcast host to be an expert at a very broad range of topics from science to business to art.

But there is one from LLMs, even just from the fact that AI companies diligently post various benchmarks including trivia on those topics.

▲

simianwords 4 days ago | parent | prev [-]

Do you question everything your dad says?

	▲	platevoltage 4 days ago \| parent [-]
		If it's about classic American cars, no. Anything else, usually.

▲

collingreen 5 days ago | parent | prev | next [-]

Without some exploratory fact checking how do you estimate how high the accuracy is and how often you should be fact checking to maintain a good understanding?

▲

simianwords 4 days ago | parent [-]

I did initial tests so that I don't have to do it anymore.

▲

jibal 4 days ago | parent | next [-]

Everyone else has done tests that indicate that you do.

	▲	glenstein 4 days ago \| parent [-]
		And this is why you can't use personal anecdotes to settle questions of software performance. Comment sections are never good at being accountable for how vibes-driven they are when selecting which anecdotes to prefer.

▲

malfist 4 days ago | parent | prev [-]

If there's one thing that's constant it's that these systems change.

▲

mvdtnz 4 days ago | parent | prev [-]

If you're not fact checking it how could you possibly know that?

▲

JustExAWS 5 days ago | parent | prev [-]

I literally just had ChatGPT create a Python program and it used .ends_with instead of .endswith.

This was with ChatGPT 5.

I mean it got a generic built in function of one of the most popular languages in the world wrong.

▲

simianwords 5 days ago | parent [-]

"but using LLMs for answering factual questions" this was about fact checking. Of course I know LLM's are going to hallucinate in coding sometimes.

▲

JustExAWS 5 days ago | parent [-]

So it isn’t a “fact” that the built in Python function that tests whether a string ends with a substring is “endswith”?

See

https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

If you know that a source isn’t to be believed in an area you know about, why would you trust that source in an area you don’t know about?

Another funny anecdote, ChatGPT just got the Gell-Man effect wrong.

https://chatgpt.com/share/68a0b7af-5e40-8010-b1e3-ee9ff3c8cb...

▲

simianwords 5 days ago | parent | next [-]

It got it right with thinking which was the challenge I posed. https://chatgpt.com/share/68a0b897-f8dc-800b-8799-9be2a8ad54...

▲

OnlineGladiator 4 days ago | parent [-]

The point you're missing is it's not always right. Cherry-picking examples doesn't really bolster your point.

Obviously it works for you (or at least you think it does), but I can confidently say it's fucking god-awful for me.

▲

glenstein 4 days ago | parent | next [-]

>The point you're missing is it's not always right.

That was never their argument. And it's not cherry picking to make an argument that there's a definable of examples where it returns broadly consistent and accurate information that they invite anyone to test.

They're making a legitimate point and you're strawmanning it and randomly pointing to your own personal anecdotes, and I don't think you're paying attention to the qualifications they're making about what it's useful for.

▲

simianwords 4 days ago | parent | prev [-]

Am I really the one cherry picking? Please read the thread.

▲

OnlineGladiator 4 days ago | parent [-]

Yes. If someone gives an example of it not working, and you reply "but that example worked for me" then you're cherry picking when it works. Just because it worked for you does not mean it works for other people.

If I ask ChatGPT a question and it gives me a wrong answer, ChatGPT is the fucking problem.

▲

simianwords 4 days ago | parent [-]

The poster didn't use "thinking" model. That was my original challenge!!

Why don't you try the original prompt using thinking model and see if I'm cherry picking?

▲

OnlineGladiator 4 days ago | parent [-]

Every time I use ChatGPT I become incredibly frustrated with how fucking awful it is. I've used it more than enough, time and time again (just try the new model, bro!), to know that I fucking hate it.

If it works for you, cool. I think it's dogshit.

▲

simianwords 4 days ago | parent | next [-]

Share your examples so that it can be useful to everyone

▲

glenstein 4 days ago | parent | prev | next [-]

They just spent like six comments imploring you to understand that they were making a specific point: generally reliable on non-niche topics using thinking mode. And that nuance bounced off of you every single time as you keep repeating it's not perfect, dismiss those qualifications as cherry picking and repeat personal anecdotes.

I'm sorry but this is a lazy and unresponsive string of comments that's degrading the discussion.

▲

OnlineGladiator 4 days ago | parent [-]

The neat thing about HN is we can all talk about stupid shit and disagree about what matters. People keep upvoting me, so I guess my thoughts aren't unpopular and people think it's adding to the discussion.

I agree this is a stupid comment thread, we just disagree about why.

	▲	glenstein 3 days ago \| parent [-]
		Again, they were making a specific argument with specific qualifications and you weren't addressing their point as stated. And your objections such as they are would be accounted for if you were reading carefully. You seem more to be completely missing the point than expressing a disagreement so I don't agree with your premise.

▲

ninetyninenine 4 days ago | parent | prev | next [-]

Objectively he didn't cherry pick. He responded to the person and it got it right when he used the "thinking" model WHICH he did specify in his original comment. Why don't you stick to the topic rather than just declaring it's utter dog shit. Nobody cares about your "opinion" and everyone is trying to converge on a general ground truth no matter how fuzzy it is.

▲

OnlineGladiator 4 days ago | parent [-]

All anybody is doing here is sharing their opinion unless you're quoting benchmarks. My opinion is just as useless as yours, it's just some find mine more interesting and some find yours more interesting.

How do you expect to find a ground truth from a non-deterministic system using anecdata?

▲

glenstein 3 days ago | parent [-]

This isn't a people having different opinions thing, this is you overlooking specific caveats and talking past comments that you're not understanding. They weren't cherry picking, and they made specific qualifications about the circumstances where it behaves as expected, and your replies keep losing track of those details.

▲

OnlineGladiator 2 days ago | parent [-]

And I think you're completely missing the point. And you say this comment thread is a waste and yet you keep replying. What exactly are you trying to accomplish here? Do you think repeating yourself for a fifth time is going to achieve something?

▲

glenstein 2 days ago | parent [-]

The difference is I can name specific things that you are in fact demonstrably ignoring, and already did name them. You're saying you just have a different opinion, in an attempt to mirror the form of my criticism, but you can't articulate a comparable distinction and you're not engaging with the distinction I'm putting forward.

▲

OnlineGladiator 2 days ago | parent [-]

So your goal here is to say the same thing over and over again and hope I eventually give the affirmation you so desperately need? You've already declared that you're right multiple times. Nobody cares but you.

https://xkcd.com/386/

You might want to develop a sense of humor. You'll enjoy life more.

▲

glenstein a day ago | parent [-]

My goal is to invite you to think critically about the specific caveats in the comment you are replying to instead of ignoring those caveats. They said that generally speaking using thinking mode on non niche topics they can get reliable answers, and invited anyone who disagreed with it to offer examples where it fails to perform as expected, a constructive structure for counter examples in case anyone disagreed.

You basically ignored all of those specifics, and spuriously accused them of cherry picking when they weren't, and now you don't want to take responsibility for your own words and are using this conversation as a workshopping session for character attacks in hopes that you can make the conversation about something else.

▲

OnlineGladiator a day ago | parent [-]

As I've said many times before, I am aware of everything you have said. I just don't care. You seem to be really upset that someone on the internet disagrees with you. And from my perspective, you are the one that has no self-awareness and is completely missing the point. You don't even understand the conversation we're having and yet you're constantly condescending.

I'm sure if you keep repeating yourself though I'll change my mind.

▲

glenstein 9 hours ago | parent [-]

Simianwords said: "use GPT 5 with thinking and search disabled and get it to give you inaccurate facts for non niche, non deep topics" and noted that mistakes were possible, but rare.

JustExAWS replied with an example of getting Python code wrong and suggested it was a counter example. Simianwords correctly noted that their comment originally said thinking mode for factual answers on non-niche topics and posted a link that got the python answer right with thinking enabled.

That's when you entered, suggesting that Simian was "missing" the point that GPT (not distinguishing thinking or regular mode), was "not always right". But they had already acknowledged multiple times that it was not always right. They said the accuracy was "high enough", noted that LLMs get coding wrong, and reiterating that their challenge was specifically about thinking mode.

You, again without acknowledging the criteria they had noted previously, insisted this was cherry picking, missing the point that they were actually being consistent from the beginning, inviting anyone to give an example showing otherwise. At no point between then and here have you demonstrated an awareness of this criteria despite your protestations to the contrary.

Instead of paying attention to any of the details you're insulting me and retreating into irritated resentment.

	▲	OnlineGladiator 8 hours ago \| parent [-]
		Thank you for repeating yourself again. It's really hammering home the point. Please, continue.

▲

4 days ago | parent | prev [-]

[deleted]

▲

cdrini 5 days ago | parent | prev [-]

I sometimes feel like we throw around the word fact too often. If I misspell a wrd, does that mean I have committed a factual inaccuracy? Since the wrd is explicitly spelled a certain way in the dictionary?

▲

ralusek 5 days ago | parent | prev | next [-]

The real jump was 3 to 3.5. 3.5 was the first “chatgpt.” I had tried gpt 3 and it was certainly interesting, but when they released 3.5 as ChatGPT, it was a monumental leap. 3.5 to 4 was also huge compared to what we see now, but 3.5 was really the first shock.

▲

muzani 4 days ago | parent | next [-]

ChatGPT was a proper product, but as an engine, GPT-3 (davinci-001) has been my favorite all the way until 4.1 or so. It's absolutely raw and they didn't even guardrail it.

3.5 was like Jenny from customer service. davinci-001 was like Jenny the dreamer trying to make ends meet by scriptwriting, who was constantly flagged for racist opinions.

Both of these had an IQ of around 70 or so, so the customer service training made it a little more useful. But I mourn the loss of the "completion" way of interacting with AI vs "instruct" or "response".

Unfortunately with all the money in AI, we'll just see companies develop things that "pass all benchmarks", resulting in more creations like GPT-5. Grok at least seems to be on a slightly different route.

▲

vineyardmike 4 days ago | parent | next [-]

> 3.5 was like Jenny from customer service. davinci-001 was like Jenny the dreamer trying to make ends meet by scriptwriting, who was constantly flagged for racist opinions.

How do you use the product to get this experience? All my questions warrant answers with no personality.

	▲	weird-eye-issue 4 days ago \| parent [-]
		https://platform.openai.com/docs/models/davinci-002

▲

andai 4 days ago | parent | prev [-]

davinci-002 is still available, and pretty close.

▲

mat_b 4 days ago | parent | prev [-]

This was my experience as well. 3.5 was the point where stackoverflow essentially became obsolete in my workflow.

▲

verelo 4 days ago | parent | prev | next [-]

Everyone talks about 4o so positively but I’ve never consistently relied on it in a production environment. I’ve found it to be inconsistent in json generation and often it’s writing and following of the system prompt was very poor. In fact it was a huge part of what got me looking closer at anthropics models.

I’m really curious what people did with it because while it’s cool it didn’t compare well in my real world use cases.

	▲	althea_tx 4 days ago \| parent \| next [-]
		I preferred o3 for coding and analysis tasks, but appreciated 4o as a “companion model” for brainstorming creative ideas while taking long walks. Wasn’t crazy about the sycophancy but it was a decent conceptual field for playing with ideas. Steve Jobs once described the PC as a “bicycle for the mind.” This is how I feel when using models like 4o for meandering reflection and speculation.
	▲	nojs 4 days ago \| parent \| prev [-]
		For json generation (and most API things) you should be using “structured outputs”

▲

iammrpayments 5 days ago | parent | prev | next [-]

I must be crazy, because I clearly remember chatgpt 4 being downgraded before they released 4o, and I felt it was a worse model with a different label, I even choose the old chatgpt 4 when they would give me the option. I canceled my subscription around that time.

	▲	barrell 4 days ago \| parent \| next [-]
		Nah that rings a bell. 4o for me was the beginning of the end - a lot faster, but very useless for my purposes [1][2]. 4 was a very rocky model even before 4o, but shortly after the 4o launch it updated to be so much worse, and I cancelled my subscription. [1] I’m not saying it was a useless model for everyone, just for me. [2] I primarily used LLMs as divergent thinking machines for programming. In my experience, they all start out great at this, then eventually get overtrained and are terrible at this. Grok 3 when it came out had this same magic; it’s long gone now.
	▲	mastercheif 4 days ago \| parent \| prev [-]
		Not crazy. 4o was a hallucination machine. 4o had better “vibes” and was really good at synthesizing information in useful ways, but GPT-4 Turbo was a bigger model with better world knowledge.

▲

simonw 4 days ago | parent | prev | next [-]

4o also added image input (previously only previewed in GPT4-vision) and enabled advanced voice mode audio input and output.

▲

whazor 4 days ago | parent | prev | next [-]

I think that the models 4o, o3, 4.1 , each have their own strengths and weaknesses. Like reasoning, performance, speed, tool usage, friendliness etc. And that for gpt 5 they put in a router that decides which model is best.

I think they increased the major version number because their router outperforms every individual model.

At work, I used a tool that could only call tasks. It would set up a plan, perform searches, read documents, then give advanced answers for my questions. But a problem I had is that it couldn’t give a simple answer, like a summary, it would always spin up new tasks. So I copied over the results to a different tool and continued there. GPT 5 should do this all out of the box.

▲

helsinkiandrew 3 days ago | parent | prev | next [-]

It’s interesting that the Polymarket betting for “Which company has best AI model end of August?” Went from heavily OpenAI to heavily Google when 5 was released

https://polymarket.com/event/which-company-has-best-ai-model...

▲

atoav 4 days ago | parent | prev | next [-]

To me 4 to 5 got much faster, but also worse. It is much more often ignoring explicit instructions like: "generate 10 song-titles with varying length" and it generates 10 song titles that are nearly identical length. This worked somewhat well with version 3 already..

	▲	ath3nd 4 days ago \| parent [-]
		Shows that they can't solve the fundamental problems as the technology, while amusing and with some utility, is also a dead end if we are going after cognition.

▲

GaggiX 4 days ago | parent | prev | next [-]

the actual major leap was o1, going from 3.5 to 4 is just scaling, o1 is a different paradigm that skyrocketed its performance on math/physics problems (or reasoning more generally), it also made the model much more precise (essential for coding).

▲

senectus1 4 days ago | parent | prev | next [-]

when you adjust the improvements with the amount of debt incurred and the amount of profit made... ALL the versions are incremental.

This isnt sustainable.

▲

jascha_eng 5 days ago | parent | prev [-]

The real leap was going from gpt-4 to sonnet 3.5. 4o was meh, o1 was barely better than sonnet and slow as hell in comparison.

The native voice mode of 4o is still interesting and not very deeply explored though imo. I'd love to build a Chinese teaching app that actual can critique tones etc but it isn't good enough for that.

	▲	simianwords 5 days ago \| parent \| next [-]
		Its strange how Claude achieves similar performance without reasoning tokens. Did you try advanced voice mode? Apparently it got a big upgrade during gpt 5 release - it may solve what you are looking for.
	▲	Alex-Programs 4 days ago \| parent \| prev [-]
		Yeah, I'd love something where you pronounce a word and it critiques your pronunciation in detail. Maybe it could give you little exercises for each sound, critiquing it, guiding you to doing it well. If I were any good at ML I'd make it myself.