Remix.run Logo
geerlingguy 2 days ago

I've just tried replicating this on my Pi 5 16GB, running the latest llama.cpp... and it segfaults:

    ./build/bin/llama-cli -m "models/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.70bpw.gguf" -e --no-mmap -t 4
    ...
    Loading model... -ggml_aligned_malloc: insufficient memory (attempted to allocate 24576.00 MB)
    ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 25769803776
    alloc_tensor_range: failed to allocate CPU buffer of size 25769803776
    llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
    Segmentation fault
I'm not sure how they're running it... any kind of guide for replicating their results? It does take up a little over 10 GB of RAM (watching with btop) before it segfaults and quits.

[Edit: had to add -c 4096 to cut down the context size, now it loads]

LargoLasskhyfv 2 days ago | parent | next [-]

Have you tried anything with https://codeberg.org/ikawrakow/illama

https://github.com/ikawrakow/ik_llama.cpp and their 4Bit-quants?

Or maybe even Microsofts Bitnet? https://github.com/microsoft/BitNet

https://github.com/ikawrakow/ik_llama.cpp/pull/337

https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf ?

That would be an interesting comparison for running local LLMs on such low-end/edge-devices. Or common office machines with only iGPU.

thcuk 2 days ago | parent | prev | next [-]

Tested same model on Intel N100 miniPC with 16G - the hundred bucks pc

llama-server -m /Qwen3-30B-A3B-Instruct-2507-GGUF:IQ3_S --jinja -c 4096 --host 0.0.0.0 --port 8033 Got <= 10 t/s Which I think is not so bad!

On AMD Ryzen 5 5500U with Radeon Graphics and Compiled for Vulkan Got 15 t/s - could swear this morning was <= 20 t/s

On AMD Ryzen 7 H 255 w/ Radeon 780M Graphics and Compiled for Vulkan Got 40 t/s On the last I did a quick comparison with unsloth version unsloth/Qwen3-30B-A3B-GGUF:Q4_K_M and got 25 t/s Can't really comment on quality of output - seems similar

westpfelia 2 days ago | parent | prev | next [-]

Would you be able to actually get useful results from it? I'm looking into self hosting LLM's for python/js development. But I dont know if I would get useful results.

graemep 2 days ago | parent [-]

I have been thinking the same and have tried a little. I have tried some small models and got some useful results.

I have not figured out what models that fit in the available memory (say 16Gb) that would be best for doing this. A CPU model I can run on a laptop would be nice. The models I have tried are much smaller than 30B.

batch12 2 days ago | parent | prev [-]

Could they have added some swap?

geerlingguy 2 days ago | parent [-]

No, just updated the parent comment, I added -c 4096 to cut down the context size, and now the model loads.

I'm able to get 6-7 tokens/sec generation with 10-11 tokens/sec prompt processing with their model. Seems quite good, actually—much more useful than llama 3.2:3b, which has comparable performance on this Pi.

Aurornis 2 days ago | parent | next [-]

> I added -c 4096 to cut down the context size

That’s a pretty big caveat. In my experience, using a small context size is only okay for very short answers and questions. The output looks coherent until you try to use it for anything, then it turns into the classic LLM babble that looks like words are being put into a coherent order but the sum total of the output is just rambling.

layoric 2 days ago | parent | prev | next [-]

Thanks for posting the performance numbers from your own validation. 6-7 tokens/sec is quite remarkable for the hardware.

geerlingguy 2 days ago | parent [-]

Some more benchmarking, and with larger outputs (like writing an entire relatively complex TODO list app) it seems to go down to 4-6 tokens/s. Still impressive.

geerlingguy 2 days ago | parent [-]

Decided to run an actual llama-bench run and let it go for the hour or two it needs. I'm posting my full results here (https://github.com/geerlingguy/ai-benchmarks/issues/47), but 8-10 t/s pp, and 7.99 t/s tg128, this is on a Pi 5 with no overclocking. Could probably increase the numbers slightly with an overclock.

You need to have a fan/heatsink to get that speed of course, it's maxing out the CPU for the entire time.

nallic 2 days ago | parent | prev [-]

for some reason I only get 3-4 tokens/sec. I checked the CPU does not throttle or anything.