Have you tried llama.cpp with unsloth and models suited to it? GLM flash? It seemed to allow more models to be tried soon after they are released. Haven’t tried for long term deployment though, that’s the next step.

▲

pet_the_bird 6 hours ago | parent | next [-]

Highy anecdotal: I have tried various self-hosted models using both vllm and llama.cpp. I am in a situation where I have access to large amount of memory (~320 GB).

While experimenting with quantization I found that there is a non-trivial tradeoff between quality and memory footprint. Overall my experience follows the reported pattern of "2-bit is mwah, 4-bit half decent and 6-bit required for programming. Still, although MiniMax-m2.7 is useable with the 6-bit quantizations that unsloth provides, it felt like such a breath of fresh air when I used the reference full-size model.

I find it difficult to say why. I had mostly the same setup as before (parsing had to be slightly adjusted in Zed). Aside from not experiencing the thinking loops (where minimax would get stuck generating the same sentences over and over) there is little evidence of any real improvement (although the average thinking time felt shorter).

I would recommend against very low quantizations of GLM 5.0/5.1/5.2 or Kimi 2.5/2.6. Smaller models were more reliable, and therefore more useful.

	▲	embedding-shape 3 hours ago \| parent [-]
		I only have access to 96GB VRAM locally, but I'd agree with the general approach of avoiding lower quantizations, often anything below Q8 seems to suffer greatly on quality and seemingly never worth going below it, better to go for smaller model in that case. With the exception of DwarfStar + DS4-Flash with IQ2_XXS quantization, which somehow seems to not suffer as much as I'd thought. I'd still opt for a smaller model + at least Q8.

▲

verdverm 5 hours ago | parent | prev [-]

I have tried llama-cpp, vllm is nicer (ray, handles queueing, doesn't have the cache invalidation bug for qwen/gemma models) and unsloth has toxic employees in their discord.

I've run 2 qwen/gemma @8bit with full context window side-by-side. Right now I have 4 models on my spark (qwen36moe, embedding, reranker, qwen3-1.7B) to support my markdown kb tool.

The setup is not as capable, but still good and gets better with models/algos. To me, it's more about the freedom to tinker, freedom from token bill anxiety, and potential right to compute should the government/oligarchy decides it gets to decide who can access which models.

	▲	woadwarrior01 an hour ago \| parent [-]
		> unsloth has toxic employees in their discord Would you mind elaborating on this?