| ▲ | wolttam 2 days ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Why would your test be including scores of failed responses/runs? That seems confusing. (I am confused by the results your website is presenting) | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | XCSme 2 days ago | parent [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Because the idea of those benchmarks is to see how well a model performs in real-world scenarios, as most models are served via APIs, not self-hosted. So, for example, hypothetically if GPT-5.5 was super intelligent, but using it via API would fail 50% of the times, then using it in a real-life scenarios would make your workflows fail a lot more often than using a "dumber", but more stable model. My plan is to also re-test models over-time, so this should account for infrastructure improvements and also to test for model "nerfing". | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||