| ▲ | skysniper 4 hours ago | ||||||||||||||||
both are shown in battle detail page already. Time is shown in Scores table. Number of tokens are shown in Cost details at the bottom of the Scores. (I thought most people just want to see cost in USD so I put token details at the bottom) | |||||||||||||||||
| ▲ | johndough 2 hours ago | parent | next [-] | ||||||||||||||||
I would have liked aggregated results instead. Expanding 300 tables is a bit tiresome. But I guess that is easy with AI now. Here is a scatter plot of quality vs duration https://i.imgur.com/wFVSpS5.png and quality vs cost https://i.imgur.com/fqM4edw.png But I just noticed that my plot is meaningless because it conflates model quality with provider uptime. Claude Haiku has a higher average quality than Claude Opus, which does not make sense. The explanation is that network errors were credited with a quality score of 0, and there were _a lot_ of network errors. | |||||||||||||||||
| |||||||||||||||||
| ▲ | hadlock 2 hours ago | parent | prev [-] | ||||||||||||||||
some kind of top-level metric like avg tokens/task would be useful. e.g. yes stepfun is 5% the price of sonnet, but does it use 1x, 10x or 1000x more tokens to accomplish similar tasks/median per task. for example I am willing to eat a 20% quality dive from sonnet if the token use is < 10% more than sonnet. if token use is 1000x then that's something I want to know. | |||||||||||||||||
| |||||||||||||||||