Remix.run Logo
raincole 3 hours ago

Even before this, Gemini 3 has always felt unbelievably 'general' for me. It can beat Balatro (ante 8) with text description of the game alone[0]. Yeah, it's not an extremely difficult goal for humans, but considering:

1. It's an LLM, not something trained to play Balatro specifically

2. Most (probably >99.9%) players can't do that at the first attempt

3. I don't think there are many people who posted their Balatro playthroughs in text form online

I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.

[0]: https://balatrobench.com/

ankit219 4 minutes ago | parent | next [-]

Agreed. Gemini 3 Pro for me has always felt like it has had a pretraining alpha if you will. And many data points continue to support that. Even as flash, which was post trained with different techniques than pro is good or equivalent at tasks which require post training, occasionally even beating pro. (eg: in apex bench from mercor, which is basically a tool calling test - simplifying - flash beats pro). The score on arc agi2 is another datapoint in the same direction. Deepthink is sort of parallel test time compute with some level of distilling and refinement from certain trajectories (guessing based on my usage and understanding) same as gpt-5.2-pro and can extract more because of pretraining datasets.

(i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)

tl 5 minutes ago | parent | prev | next [-]

Per BalatroBench, gemini-3-pro-preview makes it to round (not ante) 19.3 ± 6.8 on the lowest difficulty on the deck aimed at new players. Round 24 is ante 8's final round. Per BalatroBench, this includes giving the LLM a strategy guide, which first-time players do not have. Gemini isn't even emitting legal moves 100% of the time.

ebiester 2 hours ago | parent | prev | next [-]

It's trained on YouTube data. It's going to get roffle and drspectred at the very least.

silver_sun 2 hours ago | parent | prev | next [-]

Google has a library of millions of scanned books from their Google Books project that started in 2004. I think we have reason to believe that there are more than a few books about effectively playing different traditional card games in there, and that an LLM trained with that dataset could generalize to understand how to play Balatro from a text description.

Nonetheless I still think it's impressive that we have LLMs that can just do this now.

mjamesaustin an hour ago | parent | next [-]

Winning in Balatro has very little to do with understanding how to play traditional poker. Yes, you do need a basic knowledge of different types of poker hands, but the strategy for succeeding in the game is almost entirely unrelated to poker strategy.

gilrain an hour ago | parent | prev [-]

If it tried to play Balatro using knowledge of, e.g., poker, it would lose badly rather than win. Have you played?

gcr an hour ago | parent [-]

I think I weakly disagree. Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.

barnas2 an hour ago | parent [-]

>Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.

Maybe in the early rounds, but deck fixing (e.g. Hanged Man, Immolate, Trading Card, DNA, etc) quickly changes that. Especially when pushing for "secret" hands like the 5 of a kind, flush 5, or flush house.

winstonp 3 hours ago | parent | prev | next [-]

DeepSeek hasn't been SotA in at least 12 calendar months, which might as well be a decade in LLM years

cachius 3 hours ago | parent [-]

What about Kimi and GLM?

zozbot234 an hour ago | parent [-]

These are well behind the general state of the art (1yr or so), though they're arguably the best openly-available models.

tgrowazay 26 minutes ago | parent [-]

According to artificial analysis ranking, GLM-5 is at #4 after Claude Opus 4.5, GPT-5.2-xhigh and Claude Opus 4.6 .

tehsauce 15 minutes ago | parent | prev | next [-]

How does it do on gold stake?

dudisubekti 2 hours ago | parent | prev | next [-]

But... there's Deepseek v3.2 in your link (rank 7)

littlestymaar 2 hours ago | parent | prev | next [-]

> . I don't think there are many people who posted their Balatro playthroughs in text form online

There are *tons* of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model.

sdwr 2 hours ago | parent [-]

Yeah, or just the steam text guides would be a huge advantage.

I really doubt it's playing completely blind

acid__ 2 hours ago | parent | prev | next [-]

> Most (probably >99.9%) players can't do that at the first attempt

Eh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it.

Falsintio 2 hours ago | parent | prev [-]

[dead]