Remix.run Logo
gertlabs 6 hours ago

Neither intelligence nor context are what really differentiate the most successful model for programming (Claude Opus 4.6) from slightly 'smarter' competitors (Codex 5.3, Gemini 3.1 Pro).

It's tool use and personality. If models stopped advancing today, we could still reach effective AGI with years of refining harnesses. There is still incredible untapped potential there.

I maintain a benchmark at https://gertlabs.com that competes models against each other in competitive, open-ended games. It's harder to game the benchmark because there's no correct answer (at least none that any of the models have gotten remotely close to) and it requires anticipation of other players' behavior.

One thing I've found is that Codex and Gemini models tend to perform the best at one-shotting problems, but when given a harness and tools to iterate towards a solution, Anthropic models continue improving where Codex and Gemini struggle to use tools they weren't trained on or take the initiative to follow the high level objectives.

mr_00ff00 6 hours ago | parent [-]

“ If models stopped advancing today, we could still reach effective AGI with years of refining harnesses.”

Unless you’re a machine learning engineer with something to share, our current models are not even close to general AGI, and won’t make it.

My understanding (as just an engineer) is that LLMs continue to improve at crazy rates, but it’s clear this is not the answer for AGI.

gertlabs 6 hours ago | parent [-]

I think if I asked for most HN users' requirements for AGI 8 years ago, we would already be well past them. Now that we see the nature of how artificial intelligence is unfolding, and how the intelligence is different than human intelligence, everyone is moving their goalposts (including me).

But if we're being honest, frontier LLMs are effectively more intelligent than a non-negligible proportion of the population (for example, at pretty much all white collar IC work, pattern matching, problem solving, etc.). And in the ways that most people are still smarter (having sentience/emotions/desires that drive us to take initiative toward meaningful goals), I think it's great that AI does not match us there, but also doesn't disqualify it from being intelligent. The harness can bridge the gap there.