Remix clone Hacker News

new | show | ask | jobs Github

	▲	dmezzetti a year ago
		Seeing a lot of Ollama vs running llama.cpp direct talk here. I agree that setting up llama.cpp with CUDA isn't always the easiest. But there is a cost to running all inference over HTTPS. Local in-program inference will be faster. Perhaps that doesn't matter in some cases but it's worth noting. I find that running PyTorch is easier to get up and running. For quantization, AWQ models work and it's just a "pip install" away.