Correct me if I am wrong, but by the looks of things on that chart the reduction in token use and the better score are all related to the fact that this method used 512 samples.... This doesn't seem to be of any use for local running agents or anything that has severe vram restrictions such as local models that people can run at home. So this would only benefit enterprise level systems no?

▲

Der_Einzige a day ago | parent [-]

This is inference time scaling where as it tries to generate a sample which through logprobs "looks wrong" it early cutsoff. It has a vLLM implementation which is easy to install and use. You can apply the technique to some 4bit model 7b model on your old laptop tier nvidia GPU easily.

Well, the folks on this website think installing vLLM (pip install vLLM...) is hard and that ollama - a far slower and shittier inference engine - is better. Enormous damage has been done to the hobbyist LLM ecosystem due to folks not knowing what tools work on what platform.

The one exception is for mac peasants where llama.cpp is still probably the best implementation, but if you have nvidia and you're not using sglang or vLLM, you're doing it wrong.

But this is of ENORMOUS use for folks who want to run tiny models at home. Go to bed wake up with a K=512 solution answer to your problem.

	▲	vlovich123 a day ago \| parent \| next [-]
		If you think getting VLLM working correctly is just a pip install vllm, you haven’t tried it in very many environments.
	▲	jxf a day ago \| parent \| prev [-]
		As someone who is operating an enterprise platform that uses vLLM in the stack, it's immensely harder than "pip install vllm" to have it working at scale and kept up to date.