| ▲ | athrowaway3z 6 hours ago | |
This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's. They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water. What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc. As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican. | ||