Remix.run Logo
thefourthchime 5 hours ago

I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.

bitexploder 2 hours ago | parent | next [-]

Something else to consider. I often have much better success with something like: Create a prompt that creates a specification for a pacman game in a single html page. Consider edge cases and key implementation details that result in bugs. <take prompt>, execute prompt. It will often yield a much better result than one generic prompt. Now that models are trained on how to generate prompts for themselves this is quite productive. You can also ask it to implement everything in stages and implement tests, and even evaluate its tests! I know that isn't quite the same as "Implement pacman on an HTML page" but still, with very minimal human effort you can get the intended result.

amelius an hour ago | parent [-]

I thought this kind of chaining was already part of these systems.

Workaccount2 3 hours ago | parent | prev | next [-]

It made a working game for me (with a slightly expanded prompt), but the ghosts got trapped in the box after coming back from getting killed. A second prompt fixed it. The art and animation however was really impressive.

ofa0e 5 hours ago | parent | prev [-]

Your benchmarks should not involve IP.

sowbug 4 hours ago | parent | next [-]

The only intellectual property here would be trademark. No copyright, no patent, no trade secret. Unless someone wants to market the test results as a genuine Pac-Man-branded product, or otherwise dilute that brand, there's nothing should-y about it.

bongodongobob 3 hours ago | parent [-]

It's not an ethics thing. It's a guardrails thing.

sowbug 2 hours ago | parent [-]

That's a valid point, though an average LLM would certainly understand the difference between trademark and other forms of IP. I was responding to the earlier comment, whose author later clarified that it represented an ethical stance ("stealing the hard work of some honest, human souls").

ComplexSystems 5 hours ago | parent | prev [-]

Why? This seems like a reasonable task to benchmark on.

adastra22 5 hours ago | parent | next [-]

Because you hit guard rails.

ofa0e 5 hours ago | parent | prev [-]

Sure, reasonable to benchmark on if your goal is to find out which companies are the best at stealing the hard work of some honest, human souls.

scragz 4 hours ago | parent | next [-]

correction: pacman is not a human and has no soul.

tomalbrc 2 hours ago | parent | prev [-]

tech bros hate reality