| ▲ | plagiarist 2 hours ago | |
IMO it should need a third party running the LLM anyway. Otherwise the evaluated company could notice they're receiving the same requests daily and discover benchmarking that way. | ||
| ▲ | jabedude 42 minutes ago | parent [-] | |
But that's removing a component that's critical for the test. We as users/benchmark consumers care that the service as provided by Anthropic/OpenAI/Google is consistent over time given the same model/prompt/context | ||