Remix.run Logo
m_ke 20 hours ago

It's the new underpaid employee that you're training to replace you.

People need to understand that we have the technology to train models to do anything that you can do on a computer, only thing that's missing is the data.

If you can record a human doing anything on a computer, we'll soon have a way to automate it

xyzzy123 19 hours ago | parent | next [-]

Sure, but do you want abundance of software, or scarcity?

The price of having "star trek computers" is that people who work with computers have to adapt to the changes. Seems worth it?

worldsayshi 19 hours ago | parent | next [-]

My only objection here is that technology wont save us unless we also have a voice in how it is used. I don't think personal adaptation is enough for that. We need to adapt our ways to engage with power.

krackers 15 hours ago | parent | prev | next [-]

Abundance of services before abundance of physical resources seems like the worst of both worlds.

lanfeust6 13 hours ago | parent [-]

Aggressively expanding solar would make electrical power a solved problem, and other previously non-abatable sources of kinetic energy are innovating to use this instead of fossil fuels

almostdeadguy 18 hours ago | parent | prev | next [-]

Both abundance and scarcity can be bad. If you can't imagine a world where abundance of software is a very bad thing, I'd suggest you have a limited imagination?

jimbokun 12 hours ago | parent | prev [-]

It’s not worth it because we don’t have the Star Trek culture to go with it.

Given current political and business leadership across the world, we are headed to a dystopian hellscape and AI is speeding up the journey exponentially.

agumonkey 19 hours ago | parent | prev | next [-]

It's a strange economical morbid dependency. AI companies promises incredible things but AI agents cannot produce it themselves, they need to eat you slowly first.

gtowey 18 hours ago | parent [-]

Perfect analogy for capitalism.

xnx 19 hours ago | parent | prev | next [-]

Exactly. If there's any opportunity around AI it goes to those who have big troves of custom data (Google Workspace, Office 365, Adobe, Salesforce, etc.) or consultants adding data capture/surveillance of workers (especially high paid ones like engineers, doctors, lawyers).

mylifeandtimes 18 hours ago | parent | prev | next [-]

> the new underpaid employee that you're training to replace you.

and who is also compiling a detailed log of your every action (and inaction) into a searchable data store -- which will certainly never, NEVER be used against you

Gigachad 19 hours ago | parent | prev | next [-]

Data clearly isn't the only issue. LLMs have been trained on orders of magnitude more data than any person has ever seen.

polotics 19 hours ago | parent | prev | next [-]

How much practice have you got on software development with agentic assistance. Which rough edges, surprising failure modes, unexpected strengths and weaknesses, have you already identified?

How much do you wish someone else had done your favorite SOTA LLM's RLHF?

badgersnake 19 hours ago | parent | prev | next [-]

I think we’re past the “if only we had more training data” myth now. There are pretty obviously far more fundamental issues with LLMs than that.

m_ke 17 hours ago | parent [-]

i've been working in this field for a very long time, i promise you, if you can collect a dataset of a task you can train a model to repeat it.

the models do an amazing job interpolating and i actually think the lack of extrapolation is a feature that will allow us to have amazing tools and not as much risk of uncontrollable "AGI".

look at seedance 2.0, if a transformer can fit that, it can fit anything with enough data

cesarvarela 19 hours ago | parent | prev [-]

LLMs have a large quantity of chess data and still can't play for shit.

dwohnitmok 19 hours ago | parent | next [-]

Not anymore. This benchmark is for LLM chess ability: https://github.com/lightnesscaster/Chess-LLM-Benchmark?tab=r.... LLMs are graded according to FIDE rules so e.g. two illegal moves in a game leads to an immediate loss.

This benchmark doesn't have the latest models from the last two months, but Gemini 3 (with no tools) is already at 1750 - 1800 FIDE, which is approximately probably around 1900 - 2000 USCF (about USCF expert level). This is enough to beat almost everyone at your local chess club.

cesarvarela 19 hours ago | parent | next [-]

Yeah, but 1800 FIDE players don't make illegal moves, and Gemini does.

dwohnitmok 16 hours ago | parent | next [-]

1800 FIDE players do make illegal moves. I believe they make about one to two orders of magnitude less illegal moves than Gemini 3 does here. IIRC the usual statistic for expert chess play is about 0.02% of expert chess games have an illegal move (I can look that up later if there's interest to be sure), but that is only the ones that made it into the final game notation (and weren't e.g. corrected at the board by an opponent or arbiter). So that should be a lower bound (hence why it could be up to one order lower, although I suspect two orders is still probably closer to the truth).

Whether or not we'll see LLMs continue to get a lower error rate to make up for those orders of magnitude remains to be seen (I could see it go either way in the next two years based on the current rate of progress).

cesarvarela 13 hours ago | parent [-]

A player at that level making an illegal move is either tired, distracted, drunk, etc. An LLM makes it because it does not really "understand" the rules of chess.

famouswaffles 18 hours ago | parent | prev [-]

That benchmark methodology isn't great, but regardless, LLMs can be trained to play Chess with a 99.8% legal move rate.

recursive 17 hours ago | parent [-]

That doesn't exactly sound like strong chess play.

dwohnitmok 15 hours ago | parent [-]

It's enough to reliably beat amateur (e.g. maia-1900) chess engines.

overgard 15 hours ago | parent | prev | next [-]

They have literally every chess game in existence to train on, and they can't do better than 1800?

jimbokun 12 hours ago | parent [-]

Why do you think they won’t continue to improve?

runarberg 19 hours ago | parent | prev | next [-]

Wait, I may be missing something here. These benchmarks are gathered by having models play each other, and the second illegal move forfeits the game. This seems like a flawed method as the models who are more prone to illegal moves are going to bump the ratings of the models who are less likely.

Additionally, how do we know the model isn’t benchmaxxed to eliminate illegal moves.

For example, here is the list of games by Gemini-3-pro-preview. In 44 games it preformed 3 illegal moves (if I counted correctly) but won 5 because opponent forfeits due to illegal moves.

https://chessbenchllm.onrender.com/games?page=5&model=gemini...

I suspect the ratings here may be significantly inflated due to a flaw in the methodology.

EDIT: I want to suggest a better methodology here (I am not gonna do it; I really really really don’t care about this technology). Have the LLMs play rated engines and rated humans, the first illegal move forfeits the game (same rules apply to humans).

dwohnitmok 16 hours ago | parent | next [-]

The LLMs do play rated engines (maia and eubos). They provide the baselines. Gemini e.g. consistently beats the different maia versions.

The rest is taken care of by elo. That is they then play each other as well, but it is not really possible for Gemini to have a higher elo than maia with such a small sample size (and such weak other LLMs).

Elo doesn't let you inflate your score by playing low ranked opponents if there are known baselines (rated engines) because the rated engines will promptly crush your elo.

You could add humans into the mix, the benchmark just gets expensive.

emp17344 18 hours ago | parent | prev [-]

That’s a devastating benchmark design flaw. Sick of these bullshit benchmarks designed solely to hype AI. AI boosters turn around and use them as ammo, despite not understanding them.

famouswaffles 18 hours ago | parent | next [-]

Relax. Anyone who's genuinely interested in the question will see with a few searches that LLMs can play chess fine, although the post-trained models mostly seem to be regressed. Problem is people are more interested in validating their own assumptions than anything else.

https://arxiv.org/abs/2403.15498

https://arxiv.org/abs/2501.17186

https://github.com/adamkarvonen/chess_gpt_eval

dwohnitmok 16 hours ago | parent | prev | next [-]

> That’s a devastating benchmark design flaw

I think parent simply missed until their later reply that the benchmark includes rated engines.

runarberg 18 hours ago | parent | prev [-]

I like this game between grok-4.1-fast and maia-1100 (engine, not LLM).

https://chessbenchllm.onrender.com/game/37d0d260-d63b-4e41-9...

This exact game has been played 60 thousand times on lichess. The peace sacrifice Grok performed on move 6 has been played 5 million times on lichess. Every single move Grok made is also the top played move on lichess.

This reminds me of Stefan Zweig’s The Royal Game where the protagonist survived Nazi torture by memorizing every game in a chess book his torturers dropped (excellent book btw. and I am aware I just committed Godwin’s law here; also aware of the irony here). The protagonist became “good” at chess, simply by memorizing a lot of games.

famouswaffles 18 hours ago | parent [-]

The LLMs that can play chess, i.e not make an illegal move every game do not play it simply by memorized plays.

deadbabe 19 hours ago | parent | prev [-]

Why do we care about this? Chess AI have long been solved problems and LLMs are just an overly brute forced approach. They will never become very efficient chess players.

The correct solution is to have a conventional chess AI as a tool and use the LLM as a front end for humanized output. A software engineer who proposes just doing it all via raw LLM should be fired.

rodiger 19 hours ago | parent [-]

It's a proxy for generalized reasoning.

The point isn't that LLMs are the best AI architecture for chess.

deadbabe 17 hours ago | parent | next [-]

Why? Beating chess is more about searching a probability space, not reasoning.

Reasoning would be more like the car wash question.

famouswaffles 13 hours ago | parent [-]

It's not entirely clear how LLMs that can play chess do so, but it is clearly very different from the way other machines do so. The construct a board, they can estimate a players skill and adjust accordingly, and unlike other machines and similarly to humans, they are sensitive to how a certain position came to be when predicting the next move.

Regardless, there's plenty of reasoning in chess.

runarberg 19 hours ago | parent | prev [-]

> It's a proxy for generalized reasoning.

And so for I am only convinced that they have only succeeded on appearing to have generalized reasoning. That is, when an LLM plays chess they are performing Searle’s Chinese room thought experiment while claiming to pass the Turing test

iugtmkbdfil834 19 hours ago | parent | prev | next [-]

Hm.. but do they need it.. at this point, we do have custom tools that beat humans. In a sense, all LLM need is a way to connect to that tool ( and the same is true is for counting and many other aspects ).

Windchaser 19 hours ago | parent [-]

Yeah, but you know that manually telling the LLM to operate other custom tools is not going to be a long-term solution. And if an LLM could design, create, and operate a separate model, and then return/translate its results to you, that would be huge, but it also seems far away.

But I'm ignorant here. Can anyone with a better background of SOTA ML tell me if this is being pursued, and if so, how far away it is? (And if not, what are the arguments against it, or what other approaches might deliver similar capacities?)

yunyu 18 hours ago | parent [-]

This has been happening for the past year on verifiable problems (did the change you made in your codebase work end-to-end, does this mathematical expression validate, did I win this chess match, etc...). The bulk of data, RL environment, and inference spend right now is on coding agents (or broadly speaking, tool use agents that can make their own tools).

Recent advances in mathematical/physics research have all been with coding agents making their own "tools" by writing programs: https://openai.com/index/new-result-theoretical-physics/

BeetleB 19 hours ago | parent | prev | next [-]

Are you saying an LLM can't produce a chess engine that will easily beat you?

emp17344 18 hours ago | parent [-]

Plagiarizing Stockfish doesn’t make me good at chess. Same principle applies.

cindyllm 18 hours ago | parent [-]

[dead]

menaerus 19 hours ago | parent | prev [-]

Did you already forget about the AlphaZero?