| ▲ | FromTheFirstIn 4 hours ago | |||||||
And sitting right next to the data and compute factors in every cross entropy loss equation is the entropy of the language, which is just a fixed constant. There’s such a hard cap on cross entropy loss training and I never hear it come up! | ||||||||
| ▲ | aspenmartin 2 hours ago | parent [-] | |||||||
Right but that is context dependent; it drops with context length, depends on tokenizer, etc. It doesn't end up being super relevant, despite the fact that if you look at the loss for real models it's relatively large in absolute terms. But that doesn't really matter -- all of the interesting stuff happens once you start getting closer and closer to it. You've gotten past all of the easy tokens that dominate the entropy and now you get to the really challenging ones that we care about (like e.g. very difficult reasoning about a next step). | ||||||||
| ||||||||