▲ | kristianp 6 days ago | |
The size of that SWE-bench Verified prompt shows how much work has gone into the prompt to get the highest possible score for that model. A third party might go to a model from a different provider before going to that extent of fine-tuning of the prompt. |