I didn't tell you what you should think about the model. All I said is that you should have your own benchmark.

I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.

▲

gregsadetsky 5 hours ago | parent [-]

I asked a semi related question in a different thread [0] -- is the basic idea behind your benchmark that you specifically keep it secret to use it as an "actually real" test that was definitely withheld from training new LLMs?

I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?

Thanks

[0] https://news.ycombinator.com/item?id=45968665

▲

adastra22 5 hours ago | parent [-]

> if it's not public, presumably LLMs would never get better at them.

Why? This is not obvious to me at all.

▲

gregsadetsky 4 hours ago | parent [-]

You're correct of course - LLMs may get better at any task of course, but I meant that publishing the evals might (optimistically speaking) help LLMs get better at the task. If the eval was actually picked up / used in the training loop, of course.

▲

adastra22 4 hours ago | parent [-]

That kind of “get better at” doesn’t generalize. It will regurgitate its training data, which now includes the exact answer being looked for. It will get better at answering that exact problem.

But if you care about its fundamental reasoning and capability to solve new problems, or even just new instances of the same problem, then it is not obvious that publishing will improve this latter metric.

Problem solving ability is largely not from the pretraining data.

	▲	gregsadetsky 4 hours ago \| parent [-]
		Yeah, great point. I was considering working on the ability to dynamically generate eval questions whose solutions would all involve problem solving (and a known, definitive answer). I guess that this would be more valuable than publishing a fixed number of problems with known solutions. (and I get your point that in the end it might not matter because it's still about problem solving, not just rote memorization)