| ▲ | echelon 3 hours ago | |
That's the prompt. Every existing model is given that prompt and compared side-by-side.You can generate a few such sentences for more samples. Alternatively, take the top ten F500 stock performers. Some easy signal that provides enough randomness but is easy to agree upon and doesn't provide enough time to game. It's also something teams can pre-generate candidate problems for to attempt improvement across the board. But they won't have the exact questions on test day. | ||