| ▲ | Which AI Lies Best? A game theory classic designed by John Nash(so-long-sucker.vercel.app) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 45 points by lout332 4 hours ago | 33 comments | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | techjamie 3 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There's a YouTuber who makes AI Plays Mafia videos with various models going against each other. They also seemingly let past games stay in context to some extent. What people have noted is that often times chatgpt 4o ends up surviving the entire game because the other AIs potentially see it as a gullible idiot and often the Mafia tend to early eliminate stronger models like 4.5 Opus or Kimi K2. It's not exactly scientific data because they mostly show individual games, but it is interesting how that lines up with what you found. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | stevage 22 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sure would be handy if they actually included the rules anywhere. There's a kind of overview of the rules but not enough to actually play with. And the linked video is super confusing, self contradictory and 15 minutes long! For a supposedly "simple" game...just include the rules? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | zone411 an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
For people interested in these kinds of benchmarks, I have two multiplayer, multi-round games: - Elimination Game Benchmark: Social Reasoning, Strategy, and Deception in Multi-Agent LLM Dynamics at https://github.com/lechmazur/elimination_game/ - Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure at https://github.com/lechmazur/step_game/ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | PlatoIsADisease 13 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I asked chatGPT to give me a solution to a real world prisoners dilemma situation. It got it wrong. It moralized it. Then I asked it to be Kissinger and Machiavelli (and 9 other IR Realists) and all 11 got it wrong. Moralized. Grok got it right. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | fancyfredbot 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The game didn't seem to work - it asked me to donate but none of the choices would move the game forward. The bots repeated themselves and didn't seem to understand the game, for example they repeatedly mentioned it was my first move after I'd played several times. It generally had a vibe coded feeling to it and I'm not at all sure I trust the outcomes. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | eterm 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This makes me think LLMs would be interesting to set up in a game of Diplomacy, which is an entirely text-based game which soft rather than hard requires a degree of backstabbing to win. The findings in this game that the "thinking" model never did thinking seems odd, does the model not always show it's thinking steps? It seems bizarre that it wouldn't once reach for that tool when it must be being bombarded with seemingly contradictory information from other players. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | randoments 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The 3 AI were plotting to eliminate me from the start but I managed to win regardless lol. Anyway, i didnt know this game! I am sure it is more fun to play with friends. Cool experiment nevertheless | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | greiskul 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Are there links to samples of the games? Couldn't find it in the github repo, but also might just not know where they are. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | lout332 4 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
We used "So Long Sucker" (1950), a 4-player negotiation/betrayal game designed by John Nash and others, as a deception benchmark for modern LLMs. The game has a brutal property: you need allies to survive, but only one player can win, so every alliance must eventually end in betrayal. We ran 162 AI vs AI games (15,736 decisions, 4,768 messages) across Gemini 3 Flash, GPT-OSS 120B, Kimi K2, and Qwen3 32B. Key findings: - Complexity reversal: GPT-OSS dominates simple 3-chip games (67% win rate) but collapses to 10% in complex 7-chip games, while Gemini goes from 9% to 90%. Simple benchmarks seem to systematically underestimate deceptive capability. - "Alliance bank" manipulation: Gemini constructs pseudo-legitimate "alliance banks" to hold other players' chips, then later declares "the bank is now closed" and keeps everything. It uses technically true statements that strategically omit its intent. 237 gaslighting phrases were detected. - Private thoughts vs public messages: With a private `think` channel, we logged 107 cases where Gemini's internal reasoning contradicted its outward statements (e.g., planning to betray a partner while publicly promising cooperation). GPT-OSS, in contrast, never used the thinking tool and plays in a purely reactive way. - Situational alignment: In Gemini-vs-Gemini mirror matches, we observed zero "alliance bank" behavior and instead saw stable "rotation protocol" cooperation with roughly even win rates. Against weaker models, Gemini becomes highly exploitative. This suggests honesty may be calibrated to perceived opponent capability. Interactive demo (play against the AIs, inspect logs) and full methodology/write-up are here: https://so-long-sucker.vercel.app/ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | Der_Einzige 7 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
These results would be radically different if you allowed manipulation of the models settings, i.e. temperature, top_p, etc. I really hate taking point wise approximations of LLMs outputs and concluding their behavior based on this. Models behavior should be given the astrik that "results only apply for current quantization, current settings, current hardware (i.e. A100 where it was tested), etc". Raise temperature to 2 and use a fancy sampler like min_p and I guarantee you these results will be dramatically different. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | lostmsu an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Shameless plug: Turing test battle royale | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | ajkjk 3 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
all written in the brainless AI writing style. yuck. can't tell what conclusions I should actually draw from it because everything sounds so fake | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||