Remix.run Logo
zamadatix 4 hours ago

I've always found LiveBench a bit confusing to try to compare over time as the dataset isn't meant to be compared over time. It also currently claims GPT-5 Mini High from last summer is within ~15% of Claude 4.5 Opus Thinking High Effort in the average, but I'll wait with bated breath for the millions of amazing apps which couldn't be coded before to start showing up (or, more likely, be told in 6 months how these 2 benchmarks weren't the ones that should matter either). Artificial Analysis at least has the same at 20% from the top, so maybe that's the one we all agree to use for now since it implies faster growth.

> FWIW GPT 5.2 unofficial marketing includes the Erdos thing you say isn’t happening.

Certainly not, unless you're about to tell me I can pop into ChatGPT and pop out Erdos proofs regularly since #728 was massaged out with multiple prompts and external tooling a few weeks ago - which is what I was writing about. It was great, it was exciting, but it's exactly the slow growth I'm talking about.

I like using LLMs, I use them regularly, and I'm hoping they continue to get better for a long time... but this is in no way the GPT 3 -> 3.5 -> 4 era of mind boggling growth of frontier models anymore. At best, people are finding out how to attach various tooling to the models to eek more out as the models themselves very slowly improve.

nl an hour ago | parent | next [-]

> I'll wait with bated breath for the millions of amazing apps which couldn't be coded before to start showing up

Appstore releases were roughly linear until July 25 and are up 60% since then:

https://www.coatue.com/c/takes/chart-of-the-day-2026-01-22

refulgentis an hour ago | parent [-]

One of the best surgically executed nukes on HN in my 16 years here.

refulgentis an hour ago | parent | prev [-]

See peer reply re: yes, your self-chosen benchmark has been reached.

Generally, I've learned to warn myself off of a take when I start writing emotionally charged stuff like [1]. Without any prompting (who mentioned apps? and why would you without checking?), also, when reading minds, and assigning weak arguments, now and in my imagination of the future. [2]

At the very least, [2] is a signal to let the keyboard have a rest, and ideally my mind.

Bailey: > "If [there were] new LLMs...consistently solving Erdos problems at rapidly increasing rates then they'd be showing...that"

Motte: > "I can['t] pop into ChatGPT and pop out Erdos proofs regularly"

No less than Terence Tao, a month ago, pointing out your bailey was newly happening with the latest generation: https://mathstodon.xyz/@tao/115788262274999408. Not sure how you only saw one Erdos problem.

[1] "I'll wait with bated breath for the millions of amazing apps which couldn't be coded before to start showing up"

[2] "...or, more likely, be told in 6 months how these 2 benchmarks weren't the ones that should matter either"