I guess there's two things I'm still stuck on:

1. What is the purpose of the benchmark?

2. What is the purpose of publicly discussing a benchmark's results but keeping the methodology secret?

To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.

1. The purpose of the benchmark is to choose what models I use for my own system(s). This is extremely common practice in AI - I think every company I've worked with doing LLM work in the last 2 years has done this in some form.

2. I discussed that up-thread, but https://github.com/microsoft/private-benchmarking and https://arxiv.org/abs/2403.00393 discuss some further motivation for this if you are interested.

> To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.

This is an odd way of looking at it. There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.

▲

grog454 4 days ago | parent [-]

I see the potential value of private evaluations. They aren't scientific but you can certainly beat a "vibe test".

I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims.

> There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.

Then you must not be working in an environment where a better benchmark yields a competitive advantage.

	▲	eru 4 days ago \| parent [-]
		> I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims. In principle, we have ways: if nl's reports consistently predict how public benchmarks will turn out later, they can build up a reputation. Of course, that requires that we follow nl around for a while.