Remix clone Hacker News

new | show | ask | jobs Github

	▲	Der_Einzige a day ago
		This is inference time scaling where as it tries to generate a sample which through logprobs "looks wrong" it early cutsoff. It has a vLLM implementation which is easy to install and use. You can apply the technique to some 4bit model 7b model on your old laptop tier nvidia GPU easily. Well, the folks on this website think installing vLLM (pip install vLLM...) is hard and that ollama - a far slower and shittier inference engine - is better. Enormous damage has been done to the hobbyist LLM ecosystem due to folks not knowing what tools work on what platform. The one exception is for mac peasants where llama.cpp is still probably the best implementation, but if you have nvidia and you're not using sglang or vLLM, you're doing it wrong. But this is of ENORMOUS use for folks who want to run tiny models at home. Go to bed wake up with a K=512 solution answer to your problem.
	▲	vlovich123 a day ago \| parent \| next [-]
		If you think getting VLLM working correctly is just a pip install vllm, you haven’t tried it in very many environments.
	▲	jxf a day ago \| parent \| prev [-]
		As someone who is operating an enterprise platform that uses vLLM in the stack, it's immensely harder than "pip install vllm" to have it working at scale and kept up to date.