| ▲ | johnfn 4 days ago |
| Literally yesterday we had a post about GPT-5.2, which jumped 30% on ARC-AGI 2, 100% on AIME without tools, and a bunch of other impressive stats. A layman's (mine) reading of those numbers feels like the models continue to improve as fast as they always have. Then today we have people saying every iteration is further from AGI. It really perplexes me is how split-brain HN is on this topic. |
|
| ▲ | qouteall 4 days ago | parent | next [-] |
| Goodhart's law: When a measure becomes a target, it ceases to be a good measure. AI companies have high incentive to make score go up. They may employ human to write similar-to-benchmark training data to hack benchmark (while not directly train on test). Throwing your hard problem at work to LLM is a better metric than benchmarks. |
| |
| ▲ | idopmstuff 4 days ago | parent [-] | | I own a business and am constantly using working on using AI in every part of it, both for actual time savings and also as my very practical eval. On the "can this successfully be used to do work that I do or pay someone else to do more quickly/cheaply/etc." eval, I can confirm that models are progressing nicely! | | |
| ▲ | unaesoj 4 days ago | parent [-] | | I work in construction. Gpt-5.2 is the first model that has been able to make a quantity takeoff for concrete and rebar from a set of drawings. I've been testing this since o1. |
|
|
|
| ▲ | vlovich123 4 days ago | parent | prev | next [-] |
| One classic problem in all ML is ensuring the benchmark is representative and that the algorithm isn’t overfitting the benchmark. This remains an open problem for LLMs - we don’t have true AGI benchmarks and the LLMs are frequently learning the benchmark problems without actually necessarily getting that much better in real world. Gemini 3 has been hailed precisely because it’s delivered huge gains across the board that aren’t overfitting to benchmarks. |
| |
| ▲ | ipaddr 4 days ago | parent [-] | | This could be a solved problem. Come up with problems not online and compare. Later use LLMs to sort through your problems and classify between easy-difficult | | |
| ▲ | vlovich123 4 days ago | parent | next [-] | | Hard to do for an industry benchmark since doing the test in such a mode requires sending the question to the LLM which then basically puts it into a public training set. This has been tried multiple times by multiple people and it ends up not doing so great over time in terms of retaining immunity to “cheating”. | |
| ▲ | kalkin 4 days ago | parent | prev [-] | | How do you imagine existing benchmarks were created? |
|
|
|
| ▲ | FuckButtons 4 days ago | parent | prev | next [-] |
| HN is not an entity with a single perspective, and there are plenty of people on here who have a financial stake in you believing their perspective on the matter. |
| |
| ▲ | rester324 4 days ago | parent [-] | | My honest question, isn't simonw one of those people? It feels that way to me | | |
| ▲ | simonw 4 days ago | parent | next [-] | | You mean having a financial stake? Not really. I have a set of disclosures on my blog here: https://simonwillison.net/about/#disclosures I'm beginning to pick up a few more consulting opportunities based on my writing and my revenue from GitHub sponsors is healthy, but I'm not particularly financially invested in the success of AI as a product category. | | |
| ▲ | rester324 4 days ago | parent [-] | | Thanks for the link. I see that you get credits and access to embargod releases. So I understand that's not financial stake, but seems enough of an incentive to say positive things about those services, doesn't it? Not that it matters to me, and I might be wrong, but to an outsider it might seem so | | |
| ▲ | simonw 3 days ago | parent [-] | | Yeah it is, that's why I disclose this stuff. The counter-incentive here is that my reputation and credibility is more valuable to me than early access to models. This very post is an example of me taking a risk of annoying a company that I cover. I'm exposing the existence of the ChatGPT skills mechanism here (which I found out about from a tip on Twitter - it's not something I got given early access to via an NDA). It's very possible OpenAI didn't want that story out there yet and aren't happy that it's sat at the top of Hacker News right now. |
|
| |
| ▲ | yojat661 4 days ago | parent | prev [-] | | Of course he is |
|
|
|
| ▲ | noitpmeder 4 days ago | parent | prev | next [-] |
| Just because they're better at writing CS algorithms doesn't mean they're taking steps closer to anything resembling AGI. |
| |
|
| ▲ | tintor 4 days ago | parent | prev [-] |
| HM is not a single person. Different people on HM have different opinions. |
| |