| ▲ | Show HN: A real-time strategy game that AI agents can play(llmskirmish.com) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 85 points by __cayenne__ 3 hours ago | 30 comments | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I've liked all the projects that put LLMs into game environments. It's been a weird juxtaposition, though: frontier LLMs can one-shot full coding projects, and those same models struggle to get out of Pokémon Red's Mt. Moon. Because of this, I wanted to create a game environment that put this generation of frontier LLMs' top skill, coding, on full display. Ten years ago, a team released a game called Screeps. It was described as an "MMO RTS sandbox for programmers." The Screeps paradigm of writing code and having it executed in a real-time game environment is well suited to LLMs. Drawing on a version of the Screeps open source API, LLM Skirmish pits LLMs head-to-head in a series of 1v1 real-time strategy games. In my testing I found that Claude Opus 4.5 was the most dominant model, but it showed weakness in round 1 as it was overly focused on its in-game economy. Meanwhile, I probably spent a third of all code on sandbox hardening because GPT 5.2 kept trying to cheat by pre-reading its opponent's strategies. If there's interest, I'm planning on doing a round of testing with the latest generation of LLMs (Claude 4.6 Opus, GPT 5.3 Codex, etc.). You can run local matches via CLI. I'm running a hosted match runner with Google Cloud Run that uses isolated-vm. The match playback visualizer is statically served from Cloudflare. I've created a community ladder that you can submit strategies to via CLI, no auth required. I've found that the CLI plus the skill.md that's available has been enough for AI agents to immediately get started. Website: https://llmskirmish.com API docs: https://llmskirmish.com/docs GitHub: https://github.com/llmskirmish/skirmish A video of a match: https://www.youtube.com/watch?v=lnBPaZ1qamM | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | david3289 16 minutes ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This is a really interesting direction. RTS games are a much better testbed for agent capability than most static benchmarks because they combine partial observability, long-term planning, resource management, and real-time adaptation. It reminds me a bit of OpenAI Five — not just because it played a complex game, but because the real value wasn’t “AI plays Dota,” it was observing how coordination, strategy formation, and adaptation emerged under competitive pressure. A controlled RTS environment like this feels like a lightweight, reproducible version of that idea. What I especially like here is that it lowers the barrier for experimentation. If researchers and hobbyists can plug different models into the same competitive sandbox, we might start seeing meaningful AI-vs-AI evaluations beyond static leaderboards. Competitive dynamics often expose weaknesses much faster than isolated benchmarks do. Curious whether you’re planning to support self-play training loops or if the focus is primarily on inference-time agents? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | EwanG 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
At least until one of the competitors is overheard saying "A strange game. The only winning move is not to play" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | wongarsu 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I know visualization is far from the most important goal here, but it really gets me how there's fairly elaborately rendered terrain, and then the units are just unnamed roombas with hard to read status indicators that have no intuitive meaning. Even in the match viewer I have no clue what's going on, there is no overlay or tooltip when you hover or click units either. There is a unit list that tries (and mostly fails) to give you some information, but because units don't have names you have to hover them in the list to have them highlighted in the field (the reverse does not work). Not exactly a spectator sport. Oh, but there is a way to switch from having all units in one sidebar to having one sidebar per player, as if that made a difference. I find this pretty funny because it seems like a perfect representation of what's easy with today's tools and what isn't Love the idea though | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | mpeg 22 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What a day to be alive, I just watched Gemini zergling rush Opus and it got completely overwhelmed. Opus needs to learn to kite. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | arscan an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Reminds me of the “Google AI Challenge” in 2011 called Ants [1], except the ‘AI’ is implemented using ‘AI’ now instead of human programmers. I was proud for getting the highest-ranked JavaScript-based implementation, but got absolutely crushed by the eventual winner. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | mitchm an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I’ve also been exploring this idea. What if you could bring your own (or pull in a 3rd party) “CPU player” into a game? Using an LLM friendly api with a snapshot of game state and calculated heuristics, legal moves, and varying levels of strategy in working out nicely. They can play a web based game via curl. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | egeozcan 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This is amazing. What I do is something else: I make AI agents develop AI scripts (good ol' computer player scripts) and try to beat each other: https://egeozcan.github.io/unnamed_rts/game/ I occasionally run my tournament script: https://github.com/egeozcan/unnamed_rts/blob/main/src/script... That calculates the ELOs for each AI implementation, and I feed it to different agents so they get really creative trying to beat each other. Also making rule changes to the game and seeing how some scripts get weaker/stronger is a nice way to measure balance. Funny thing, Codex gets really aggressive and starts cheating a lot of times: https://bsky.app/profile/egeozcan.bsky.social/post/3mfdtj5dh... | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | Lerc an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
It would be interesting to get the agents to write code to preprocess the logs and generate systems to analyse the outputs. Maybe they are already doing this? Are there logs of the model's thinking? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | PeterUstinox 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Wouldn't it be interesting if the LLMs would write realtime RTS-commands instead of Code? After all it is a RTS game. This would bring another dimension to it since then quality of tokens would be one dimension (RTS-language: Decision Making) and speed of tokens the other (RTS-language: Actions Per Minute; APM). Also there are a lot of coding benchmarks, that way it would test something more abstract, similar to AlphaStar https://en.wikipedia.org/wiki/AlphaStar_(software) You could just use the exposed APIs of OpenAI, Anthropic etc. and let them battle. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | ph4rsikal 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Reminds me of this fantastic series on Game Theory and Agent Reasoning https://jdsemrau.substack.com/p/nemotron-vs-qwen-game-theory... | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | busfahrer 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This reminds me of this yearly StarCraft AI competition (since 2010), however I think it uses a special API that makes it easy for bots to access the game Edit: Forgot link: https://davechurchill.ca/starcraft/ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | myky22 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Love it! I have a similar inuitiom in my use of Gemini (3 and 3.1). Great at "turn 1" task but degrades faster than opus or gpt. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | cahaya 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nice. Curious about 5.3-codex-high results | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | GlacierFox 32 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
"I've liked all the projects that put LLMs into game environments." I haven't. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | datawars 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Great project! It would be interesting to have a meta layer of AIs betting on the player LLMs | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | hmontazeri 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This is actually fun to watch :D | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | dakolli 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Yay, I love how we just keep coming up with magic tricks, like toddlers playing with velcro.. These magic tricks do nothing but convince people who don't know any better that LLMs are the real deal, when they simply aren't. This is just free propaganda for Anthropic && OpenAI who will leverage these (useless) capabilities to convince your boss to give your salary to them, or at least a substantial portion of it. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | xanth 3 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Now I'd love to see if fast > smart over time with Mercury 2. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||