Could they have added some swap?

No, just updated the parent comment, I added -c 4096 to cut down the context size, and now the model loads.

I'm able to get 6-7 tokens/sec generation with 10-11 tokens/sec prompt processing with their model. Seems quite good, actually—much more useful than llama 3.2:3b, which has comparable performance on this Pi.

▲

Aurornis 2 days ago | parent | next [-]

> I added -c 4096 to cut down the context size

That’s a pretty big caveat. In my experience, using a small context size is only okay for very short answers and questions. The output looks coherent until you try to use it for anything, then it turns into the classic LLM babble that looks like words are being put into a coherent order but the sum total of the output is just rambling.

▲

layoric 2 days ago | parent | prev | next [-]

Thanks for posting the performance numbers from your own validation. 6-7 tokens/sec is quite remarkable for the hardware.

▲

geerlingguy 2 days ago | parent [-]

Some more benchmarking, and with larger outputs (like writing an entire relatively complex TODO list app) it seems to go down to 4-6 tokens/s. Still impressive.

	▲	geerlingguy 2 days ago \| parent [-]
		Decided to run an actual llama-bench run and let it go for the hour or two it needs. I'm posting my full results here (https://github.com/geerlingguy/ai-benchmarks/issues/47), but 8-10 t/s pp, and 7.99 t/s tg128, this is on a Pi 5 with no overclocking. Could probably increase the numbers slightly with an overclock. You need to have a fan/heatsink to get that speed of course, it's maxing out the CPU for the entire time.

▲

nallic 2 days ago | parent | prev [-]

for some reason I only get 3-4 tokens/sec. I checked the CPU does not throttle or anything.