▲ | porridgeraisin a day ago | |
> The fact that it was ever seriously entertained that a "chain of thought" was giving some kind of insight into the internal processes of an LLM bespeaks the lack of rigor in this field This is correct. Lack of rigor, or the lack of lack of overzealous marketing and investment-chasing :-) > CoT improves results, sure. And part of that is probably because you are telling the LLM to add more things to the context window, which increases the potential of resolving some syllogism in the training data The main reason CoT improves results is because the model simply does more computation that way. Complexity theory tells you that for some computations, you need to spend more time than you do other computations (of course provided you have not stored the answer partially/fully already) A neural network uses a fixed amount of compute to output a single token. Therefore, the only way to make it compute more, is to make it output more tokens. CoT is just that. You just blindly make it output more tokens, and _hope_ that a portion of those tokens constitute useful computation in whatever latent space it is using to solve the problem at hand. Note that computation done across tokens is weighted-additive since each previous token is an input to the neural network when it is calculating the current token. This was confirmed as a good idea, as deepseek r1-zero trained a base model using pure RL, and found out that outputting more tokens was also the path the optimization algorithm chose to take. A good sign usually. |