A pretty simple one would be to have every model try and one shot every ticket your company has and then measure the acceptance rate of each model.

▲

sam_goody 2 hours ago | parent [-]

Except that if you tried one-shotting your ticket twenty times at different hours of the day and different days of the week, you would have enough changes to make benchmarks even if you used the same model every time. Much moreso if you fiddled with the thinking or changed the prompt.

Because non-deterministic, because of constant updates and changes, and because the models are throttled according to number of users, releases, et al.

	▲	serial_dev an hour ago \| parent [-]
		You never get "the same" Steph Curry, he might be tired, annoyed by a fan, getting older... but if he and I were to throw 100 3-pointers, we could all correctly guess who will perform better.