| ▲ | zarzavat 2 hours ago | |
It makes sense to me. Originally LLMs would get stuck in infinite loops generating tokens forever. This is bad, so we trained them to strongly prefer to stop once they reached the end of their answer. However, training models to stop also gave them "laziness", because they might prefer a shorter answer over a meandering answer that actually answered the user's question. Mathematics is unusual because it has an external source of truth (the proof assistant), and also because it requires long meandering thinking that explores many dead ends. This is in tension with what models have been trained to do. So giving them some encouragement keeps them in the right state to actually attempt to solve the problem. | ||