Here's a paper from MIT that covers how this could be resolved in an interesting fashion:

https://hanlab.mit.edu/blog/streamingllm

The AI field is reusing existing CS concepts for AI that we never had hardware for, and now these people are learning how applied Software Engineering can make their theoretical models more efficient. It's kind of funny, I've seen this in tech over and over. People discover new thing, then optimize using known thing.

▲

kridsdale3 3 days ago | parent | next [-]

The fact that this is happening is where the tremendous opportunity to make money as an experienced Software Engineer currently lies.

For instance, a year or two ago, the AI people discovered "cache". Imagine how many millions the people who implemented it earned for that one.

	▲	nxobject 6 hours ago \| parent \| next [-]
		What we need are "idea dice" or "concept dice" for CS – each side could have a vague architectural nudge like "parallelize", "interpret", "precompute", "predict and unwind", "declarative"...
	▲	giancarlostoro 3 days ago \| parent \| prev [-]
		I've been thinking the same, and its things that you don't need some crazy ML degree to know how to do... A lot of the algorithms are known... for a while now... Milk it while you can.

▲

mamp 3 days ago | parent | prev [-]

Unfortunately, I think the context rot paper [1] found that the performance degradation when context increased still occurred in models using attention sinks.

1. https://research.trychroma.com/context-rot

	▲	giancarlostoro 3 days ago \| parent [-]
		Saw that paper have not had a chance to read it yet, are there other techniques that help then? I assume theres a few different ones used.