It seem strange to me that the only way to use an llm is to fit it entirely in volatile memory from the get go.

To render movies we happily wait for the computer to calculate how lights bounce around, for hours even days.

So why not do the same with AIs? Ask big question to big models and get the answer to the universe tomorrow?

Aurornis 4 hours ago | parent | next [-]

If you don’t care about turnaround time you can do that.

Most LLM use cases are about accelerating workflows. If you have to wait all night for a response and then possibly discover that it took the wrong direction, misunderstood your intent, or your prompt was missing some key information then you have to start over.

I don’t let LLMs write my code but I do a lot of codebase exploration, review, and throwaway prototyping. I have hundreds to maybe thousands of turns in the LLM conservation each day. If I had to wait 10X or 100X as long then it wouldn’t be useful. I’d be more productive ignoring a slow LLM and doing it all myself.

	▲	zozbot234 2 hours ago \| parent \| next [-]
		> If you have to wait all night for a response and then possibly discover that it took the wrong direction, misunderstood your intent, or your prompt was missing some key information then you have to start over. If you have to wait overnight because the model is offloading to disk, that's a model you wouldn't have been able to run otherwise without very expensive hardware. You haven't really lost anything. If anything, it's even easier to check on what a model is doing during a partial inference or agentic workload if the inference process is slower.
	▲	qiine 3 hours ago \| parent \| prev [-]
		"If you have to wait all night for a response and then possibly discover that it took the wrong direction, misunderstood your intent, or your prompt was missing some key information then you have to start over." This exact problem exist for rendering, when you realize that after a long render an object was missing in the background and the costly frame is now useless. To counter that you make multiple "draft" renders first to make sure everything is in the frame and your parameters are properly tuned.

▲

andoando 4 hours ago | parent | prev [-]

There's definitely use cases for this for long running tasks, like doing research, but for typical use cases they require way too much constant supervision and interaction