There's only so far engineers can optimise the underlying transformer technique, which is and always has been doing all the heavy lifting in the recent ai boom. It's going to take another genius to move this forward. We might see improvements here and there but the magnitudes of the data and vram requirements I don't think will change significantly

▲

zozbot234 4 days ago | parent | next [-]

State space models are already being combined with transformers to form new hybrid models. The state-space part of the architecture is weaker in retrieving information from context (can't find a needle in the haystack as context gets longer, the details effectively get compressed away as everything has to fit in a fixed size) but computationally it's quite strong, O(N) not O(N^2).

▲

aerhardt 4 days ago | parent | prev | next [-]

I’ve read and heard from Semi Analysis and other best-in-class analysts that the amount of software optimizations possible up and down the stack is staggering…

How do you explain that capabilities being equal, the cost per token is going down dramatically?

▲

bcjdjsndon 3 days ago | parent [-]

Optimizations, like I said. They'll never hack away the massive memory requirements however, or the pre training... Imagine the memory requirements without the pre training step....this is just part and parcel of the transformer architecture.

	▲	bcjdjsndon 3 days ago \| parent [-]
		And a lot of these improvements are really just classic automation or chaining together yet more transformer architectures, to fix issues the transformer architecture creates in the first place (hallucinations, limited context)

▲

abarth23 2 days ago | parent | prev [-]

Exactly this. To actually visualize the sheer scale of the VRAM wall we are hitting, I recently built an LLM VRAM estimator (bytecalculators.com/llm-vram-calculator).

If you play around with the math, you quickly realize that even if we heavily quantize models down to INT4 to save memory, simply scaling the context window (which everyone wants now) immediately eats back whatever VRAM we just saved. The underlying math is extremely unforgiving without fundamentally changing the architecture.