| ▲ | jansan 3 hours ago | ||||||||||||||||
So you believe one marketing department more than the other? | |||||||||||||||||
| ▲ | NitpickLawyer 3 hours ago | parent [-] | ||||||||||||||||
The brits have a step-based benchmark that they use for this - https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5... They seem pretty close, in both average and "best run" scores. And, in a highly verifiable domain, "best run" or pass@n is what you're looking for. | |||||||||||||||||
| |||||||||||||||||