▲ | anorwell 4 days ago | |
Some of the comments so far seem to be misunderstanding this submission. As I understand it: 1. Custom scaffolding (system prompt and tools) using Qwen3-32B achieved 13.75% on Terminal-Bench. No training was involved. 2. The author has built an RL system, but it has not been used for anything due to cost limitations. So there's actually no result related to training here. It well known that the scaffolding used can have a large impact on benchmark outcomes (the Terminal bench leaderboard also demonstrates this [1]). | ||
▲ | esafak 4 days ago | parent [-] | |
It looks like the submission has two aspects that are being conflated. 1. Tooling for training a terminal agent. 2. An agent that was _not_ trained with this tooling but prompt engineered. I could not find the author's discussion on this point. |