▲ | starchild3001 4 days ago | ||||||||||||||||||||||
A few data points that highlight the scale of progress in a year: 1. LM Sys (Human Preference Benchmark): GPT-5 High currently scores 1463, compared to GPT-4 Turbo (04/03/2024) at 1323 -- a 140 ELO point gap. That translates into GPT-5 winning about two-thirds of head-to-head comparisons, with GPT-4 Turbo only winning one-third. In practice, people clearly prefer GPT-5’s answers (https://lmarena.ai/leaderboard). 2. Livebench.ai (Reasoning Benchmark with Internet-new Questions): GPT-5 High scores 78.59, while GPT-4o reaches just 47.43. Unfortunately, no direct GPT-4 Turbo comparison is available here, but against one of the strongest non-reasoning models, GPT-5 demonstrates a massive leap. (https://livebench.ai/) 3. IQ-style Testing: In mid-2024, best AI models scored roughly 90 on standard IQ tests. Today, they are pushing 135, and this improvement holds even on unpublished, internet-unseen datasets. (https://www.trackingai.org/home) 4. IMO Gold, vibe coding: 1 yr ago, AI coding was limited to smaller code snippets, not to wholly vibe coded applications. Vibe coding and strength in math has many applications across sciences and engineering. My verdict: Too often, critics miss the forest for the trees, fixating on mistakes while overlooking the magnitude of these gains. Errors are shrinking by the day, while the successes keep growing fast. | |||||||||||||||||||||||
▲ | NoahZuniga 4 days ago | parent [-] | ||||||||||||||||||||||
The 135 iq result is on Mensa Norway, while the offline test is 120. It seems probable that similar questions to the one in Mensa are in the training data, so it probably overestimates "general intelligence". | |||||||||||||||||||||||
|