I wish they would release the requirements to run on llama.cpp with any announcements of open models.

A bonus would be tok/s on common hardware.

I don't think llama.cpp supports any of the LongCat models, actually.

They haven't posted weights/inference solutions for LongCat-2.0 [1], but LongCat-Next had transformers support, which I assume means it works with vLLM/SGLang.

Given it's 1.6T, "common hardware" is probably out of the question; even 2bpw is going to measure out at 400GB, even before considering the bandwidth requirements for 48B active. I haven't read the LongCat-2.0 architecture docs, but if you're not running GLM-5.2, you're probably not running this either.

[1] https://huggingface.co/meituan-longcat/LongCat-2.0: "Model weights coming soon — stay tuned!"

▲

nl 3 hours ago | parent | next [-]

Yeah, for me it seems like a if you have to ask you can't run it" type question.

In general the TL;DR is that anything above 35B needs hardware you buy basically only to run large LLMs, and if you have that hardware you don't need to ask the question.

	▲	hnfong 17 minutes ago \| parent [-]
		That's simply not true. ~70B models can run fine (albeit somewhat slow) on consumer hardware with 64GB RAM. There are heavily quantized (Q1.x) models that are still usable on similar hardware. Granted recently there haven't been a lot of models of this size, but still, 35B isn't really the practical limit. 35B is mostly the limit if you're using consumer grade GPUs with limited RAM and need the model to run fast. People have been toying with running large-ish models by partially offloading on CPU+RAM with mixed results, but as long as you're OK with reduced speed, and you quantize the hell out of the big models, you can apparently try a lot more models locally than popular belief.

▲

aetherspawn an hour ago | parent | prev [-]

Ah yes but because it’s a MoE 48GB active model, then it’s possible that we might be able to run it locally in specialised setups such as 256GB unified memory.

Many MoE models (seem?) to only require enough memory to load the active expert.