Remix.run Logo
Aerroon 5 hours ago

I think the workflows can be really interesting to read about. The other week I read a reddit post how someone got Qwen3.5 35B-A3B to go from 22.2% on the 45 hard problems of swebench-verified to 37.8% (opus 4.6 gets 40%).

All they essentially did was tell the LLM to test and verify whether the answer is correct with a prompt like the following:

>"You just edited X. Before moving on, verify the change is correct: write a short inline python -c or a /tmp test script that exercises the changed code path, run it with bash, and confirm the output is as expected."

Now whether this is true, I don't know, but I think talking about this kind of stuff is cool!