| ▲ | dTal 2 hours ago | ||||||||||||||||
Yeah but not all tokens are created equal. Some tokens are hard to predict and thus encode useful information; some are highly predictable and therefore don't. Spending an entire forward pass through the token-generation machine just to generate a very low-entropy token like "is" is wasteful. The LLM doesn't get to "remember" that thinking, it just gets to see a trivial grammar-filling token that a very dumb LLM could just as easily have made. They aren't stenographically hiding useful computation state in words like "the" and "and". | |||||||||||||||||
| ▲ | Chance-Device 2 hours ago | parent | next [-] | ||||||||||||||||
> They aren't stenographically hiding useful computation state in words like "the" and "and". Do you know that is true? These aren’t just tokens, they’re tokens with specific position encodings preceded by specific context. The position as a whole is a lot richer than you make it out to be. I think this is probably an unanswered empirical question, unless you’ve read otherwise. | |||||||||||||||||
| |||||||||||||||||
| ▲ | 8note 2 hours ago | parent | prev [-] | ||||||||||||||||
can you prove this? train an LLM to leave out the filler words, and see it get the same performance at a lower cost? or do it at token selection time? | |||||||||||||||||
| |||||||||||||||||