Remix.run Logo
mohsen1 2 hours ago

Hope you don't mind the unrelated question:

How do you pay for those SWE-bench runs?

I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.

https://mafia-arena.com

ofirpress 2 hours ago | parent [-]

Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.

Dolores12 2 hours ago | parent | next [-]

so basically they know requests using your API key should be treated with care?

Deklomalo 2 hours ago | parent [-]

[dead]

epolanski 2 hours ago | parent | prev | next [-]

The last thing a proper benchmark should do is reveal it's own API key.

sejje an hour ago | parent | next [-]

That's a good thought I hadn't had, actually.

plagiarist 32 minutes ago | parent | prev [-]

IMO it should need a third party running the LLM anyway. Otherwise the evaluated company could notice they're receiving the same requests daily and discover benchmarking that way.

mohsen1 2 hours ago | parent | prev [-]

yes I reached out to them but as you say it's a chicken-and-egg problem.

Thanks!