| ▲ | mccoyb 5 hours ago | |
I don't disagree with this, but I don't think the memory cost is the only issue right? I remember using Sonnet 4.5 (or 4, I can't remember the first of Anthropic's offerings with a million context) and how slow the model would get, how much it wanted to end the session early as tokens accrued (this latter point, of course, is just an artifact of bad training). Less worried about memory, more worried about compute speed? Are they obviously related and is it straightforward to see? | ||
| ▲ | sosodev 3 hours ago | parent | next [-] | |
The compute speed is definitely correlated with the memory consumption in LLM land. More efficient attention means both less memory and faster inference. Which makes sense to me because my understanding is that memory bandwidth is so often the primary bottleneck. We're also seeing a recent rise in architectures boosting compute speed via multi-token prediction (MTP). That way a single inference batch can produce multiple tokens and multiply the token generation speed. Combine that with more lean ratios of active to inactive params in MOE and things end up being quite fast. The rapid pace of architectural improvements in recent months seems to imply that there are lots of ways LLMs will continue to scale beyond just collecting and training on new data. | ||
| ▲ | whimsicalism 2 hours ago | parent | prev [-] | |
The parent commentator is a bit confused - most of the innovation in these hybrid architectures comes from reducing the computation pressure not just the memory pressure. | ||