| ▲ | charcircuit 2 hours ago | |||||||
A pretty simple one would be to have every model try and one shot every ticket your company has and then measure the acceptance rate of each model. | ||||||||
| ▲ | sam_goody 2 hours ago | parent [-] | |||||||
Except that if you tried one-shotting your ticket twenty times at different hours of the day and different days of the week, you would have enough changes to make benchmarks even if you used the same model every time. Much moreso if you fiddled with the thinking or changed the prompt. Because non-deterministic, because of constant updates and changes, and because the models are throttled according to number of users, releases, et al. | ||||||||
| ||||||||