Remix.run Logo
ofirpress 3 hours ago

Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.

Dolores12 3 hours ago | parent | next [-]

so basically they know requests using your API key should be treated with care?

Deklomalo 3 hours ago | parent [-]

[dead]

epolanski 3 hours ago | parent | prev | next [-]

The last thing a proper benchmark should do is reveal it's own API key.

sejje 3 hours ago | parent | next [-]

That's a good thought I hadn't had, actually.

plagiarist 2 hours ago | parent | prev [-]

IMO it should need a third party running the LLM anyway. Otherwise the evaluated company could notice they're receiving the same requests daily and discover benchmarking that way.

jabedude 44 minutes ago | parent [-]

But that's removing a component that's critical for the test. We as users/benchmark consumers care that the service as provided by Anthropic/OpenAI/Google is consistent over time given the same model/prompt/context

mohsen1 3 hours ago | parent | prev [-]

yes I reached out to them but as you say it's a chicken-and-egg problem.

Thanks!