A few data points that highlight the scale of progress in a year:

1. LM Sys (Human Preference Benchmark):

GPT-5 High currently scores 1463, compared to GPT-4 Turbo (04/03/2024) at 1323 -- a 140 ELO point gap. That translates into GPT-5 winning about two-thirds of head-to-head comparisons, with GPT-4 Turbo only winning one-third. In practice, people clearly prefer GPT-5’s answers (https://lmarena.ai/leaderboard).

2. Livebench.ai (Reasoning Benchmark with Internet-new Questions):

GPT-5 High scores 78.59, while GPT-4o reaches just 47.43. Unfortunately, no direct GPT-4 Turbo comparison is available here, but against one of the strongest non-reasoning models, GPT-5 demonstrates a massive leap. (https://livebench.ai/)

3. IQ-style Testing:

In mid-2024, best AI models scored roughly 90 on standard IQ tests. Today, they are pushing 135, and this improvement holds even on unpublished, internet-unseen datasets. (https://www.trackingai.org/home)

4. IMO Gold, vibe coding:

1 yr ago, AI coding was limited to smaller code snippets, not to wholly vibe coded applications. Vibe coding and strength in math has many applications across sciences and engineering.

My verdict: Too often, critics miss the forest for the trees, fixating on mistakes while overlooking the magnitude of these gains. Errors are shrinking by the day, while the successes keep growing fast.

▲

NoahZuniga 4 days ago | parent [-]

The 135 iq result is on Mensa Norway, while the offline test is 120. It seems probable that similar questions to the one in Mensa are in the training data, so it probably overestimates "general intelligence".

▲

TrackerFF 4 days ago | parent | next [-]

Some iq / aptitude test sections are trivial for machines, like working memory. Wonder if those parts are just excluded? As the could really pull up the test scores.

	▲	cman1444 3 days ago \| parent [-]
		If they are excluded, we should be calling the score something other than just "IQ". I don't think we should be moving the goal posts for some testers (machines) just because they are significantly better at some types of questions than other testers (humans).

▲

starchild3001 4 days ago | parent | prev [-]

If you focus on the year over year jump, not on absolute numbers, you realize that the improvement in public test isn't very different from the improvement in private test.