Remix.run Logo
dmezzetti 4 hours ago

Seeing a lot of Ollama vs running llama.cpp direct talk here. I agree that setting up llama.cpp with CUDA isn't always the easiest. But there is a cost to running all inference over HTTPS. Local in-program inference will be faster. Perhaps that doesn't matter in some cases but it's worth noting.

I find that running PyTorch is easier to get up and running. For quantization, AWQ models work and it's just a "pip install" away.