Don't have a GPU so tried the CPU option and got 0.6t/s on my old 2018 laptop using their llama.cpp fork.

Then found out they didn't implement AVX2 for their Q1_0_g128 CPU kernel. Added that and getting ~12t/s which isn't shabby for this old machine.

Cool model.

▲ UncleOxidant 42 minutes ago | parent | next [-]

Are you getting anything besides gibberish out of it? I tried their recommended commandline and it's dog slow even though I built their llama.cpp fork with AVX2 enabled. This is what I get:

    $ ./build/bin/llama-cli     -hf prism-ml/Bonsai-8B-gguf -p "Explain quantum computing in simple terms." -n 256 --temp 0.5 --top-p 0.85 --top-k 20 -ngl 99
    > Explain quantum computing in simple terms.

     \( ,

      None ( no for the. (,./. all.2... the                                                                                                                                ..... by/

▲ cubefox an hour ago | parent | prev | next [-]

"Not shabby" is a big understatement.

	▲	ddtaylor 12 minutes ago \| parent [-]
		Why so?

▲ 2 hours ago | parent | prev [-]

[deleted]

	▲	2 hours ago \| parent [-]
		[deleted]