| ▲ | JB_5000 2 hours ago | |
Interesting benchmark, but worth noting the methodology: skills are generated before the task, with no feedback loop. In practice, useful skills tend to emerge from doing — you attempt, observe what failed, then codify what worked. Generate → execute → observe → refine. The paper tests cold generation, which is a different (and less realistic) setup. | ||