What about real workloads? Because as context gets larger, these local LLMs aproxiate the useless end of the spectrum with regards to t/s.

▲

zozbot234 3 hours ago | parent | next [-]

The thing about context/KV cache is that you can swap it out efficiently, which you can't with the activations because they're rewritten for every token. It will slow down as context grows (decode is often compute-limited when context is large) but it will run.

▲

Someone1234 9 hours ago | parent | prev | next [-]

I strongly agree. People see local "GPT-4 level" responses, and get excited, which I totally get. But how quickly is the fall-off as the context size grows? Because if it cannot hold and reference a single source-code file in its context, the efficiency will absolutely crater.

That's actually the biggest growth area in LLMs, it is no longer about smart, it is about context windows (usable ones, note spec-sheet hypotheticals). Smart enough is mostly solved, combating larger problems is slowly improving with every major release (but there is no ceiling).

▲

satvikpendem 8 hours ago | parent | prev [-]

That should be covered by the harness rather than the LLM itself, no? Compaction and summarization should be able to allow the LLM to still run smoothly even on large contexts.

	▲	hu3 6 hours ago \| parent [-]
		Sometimes it really needs a lot of data to work.