My fav coding benchmark for frontier models is to build a simple RTS game in one file (js/html/css). Claude Code with Opus 4.8 in ultracode mode nailed it, the best result so far:

https://bsky.app/profile/senko.net/post/3mmwnrkwboc2v

The prompt was: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).

▲

H3X_K1TT3N 2 minutes ago | parent | next [-]

Thanks for also sharing the prompt. I've been testing claude by asking it to make similar things, so it's useful to see what other people are doing.

I do find it interesting that the visual style is pretty similar to things it's produced for me.

▲

apitman 3 hours ago | parent | prev | next [-]

I like that benchmark. You should throw the results up on GitHub pages so people can try out the games.

▲

brandly 2 hours ago | parent [-]

Yeah! Host on GitHub pages, so it's easy to click a link and play!

	▲	senko an hour ago \| parent [-]
		Great idea! I have a static server of my own, so here's my list (of all the tests I published so far): https://senko.net/vibecode-bench/

▲

jclay 3 hours ago | parent | prev | next [-]

It almost appears as if the code was minified. The variable names are short and formatting looks like it's written to minimize whitespace. Did it write it in this compact format all on it's own?

	▲	senko 2 hours ago \| parent \| next [-]
		Yeah looks extremely compact. I didn't instruct it or told it to use as few lines of code or characters or nothing of the sort. Not sure why it did that. Its own rationale (which is highly suspect, but the only lead I have) is that it defaults to dense style if it has to write a file in a single go. May be a kernel of truth somewhere in there.
	▲	andai 2 hours ago \| parent \| prev [-]
		A friend sent me something he vibe coded which included a massive webassembly blob in the HTML file. My friend is not a programmer so he was not able to explain to me how it did that.

▲

digdugdirk 2 hours ago | parent | prev | next [-]

Do you have a collection of these benchmark apps saved anywhere? I'd be particularly interested in seeing the relative cost differences between different models in a use case like this.

	▲	senko an hour ago \| parent [-]
		I'm saving them all as gists here: https://gist.github.com/senko But I just vibe-coded a handy list of all the tests I did (unfortunately without the commentary I usually leave in social media posts -- I should add those at some point): https://senko.net/vibecode-bench/

▲

elAhmo 3 hours ago | parent | prev | next [-]

What is ultracode mode?

	▲	senko 2 hours ago \| parent \| next [-]
		It's a combination of reasoning effort (max) + enabling workflow that orchestrates multiple sub-agents. After some interrogation, here's how it organized the work: 1. Design workflow (rts-game-design, 11 agents, ~13 min) ran first, produced SPEC.md + DESIGN.md: 1.1. Proposals (3 parallel agents): each designed a complete RTS from a different philosophy 1.2 Judge (1 agent): evaluated all three and synthesized one unified design, committing to specific numbers (costs, HP, map size, etc.). 1.3 Deep-dives (6 parallel agents): each wrote an implementation-ready spec for one subsystem, all consistent with the chosen design 1.4 Synthesis (1 agent): merged the design + all six subsystem specs into one conflict-free master spec 2. Code-review workflow (rts-code-review, 25 agents, ~5 min), ran after the main agent had written and tested the code: 2.1 Review (6 agents, read-only Explore type): each scrutinized one dimension and returned structured findings. 2.2. Verify (19 agents): every finding got its own skeptic agent told to try to refute it, Result: 19 flagged → 16 confirmed, 3 rejected as non-bugs. What the main agent did in the main loop: - Wrote all ~2,400 lines of index.html by hand from the spec. - All browser testing/debugging via headless Chrome (I told it to use rodney by @simonw, love the tool :) - Applied all 16 fixes from the review and re-verified them in the browser.
	▲	colechristensen an hour ago \| parent \| prev \| next [-]
		Biases the model to solve problems with teams of agents
	▲	tcoff91 2 hours ago \| parent \| prev [-]
		it's a brand new mode

▲

jryan49 2 hours ago | parent | prev | next [-]

Kinda buggy, but impressively nonetheless. How long did it take?

	▲	senko 2 hours ago \| parent [-]
		It took 50 minutes, would be ~$20 in API costs (I'm on a Pro sub).

▲

l3x4ur1n 3 hours ago | parent | prev [-]

Played it to the end. Pretty neat!