| ▲ | nl 4 days ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I have a bunch of private benchmarks I run against new models I'm evaluating. The reason I don't disclose isn't generally that I think an individual person is going to read my post and update the model to include it. Instead it is because if I write "I ask the question X and expect Y" then that data ends up in the train corpus of new LLMs. However, one set of my benchmarks is a more generalized type of test (think a parlor-game type thing) that actually works quite well. That set is the kind of thing that could be learnt via reinforcement learning very well, and just mentioning it could be enough for a training company or data provider company to try it. You can generate thousands of verifiable tests - potentially with verifiable reasoning traces - quite easily. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | grog454 4 days ago | parent [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Ok, but then your "post" isn't scientific by definition since it cannot be verified. "Post" is in quotes because I don't know what you're trying to but you're implying some sort of public discourse. For fun: https://chatgpt.com/s/t_694361c12cec819185e9850d0cf0c629 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||