Remix clone Hacker News

new | show | ask | jobs Github

	▲	freakynit 2 days ago
		I have older M1 air with 8GB, but still getting ober 23 t/s on 4B model.. and the quality of outputs is on par with top models of similar size. 1. Clone their forked repo: `git clone https://github.com/PrismML-Eng/llama.cpp.git` 2. Then (assuming you already have xcode build tools installed): `cd llama.cpp cmake -B build -DGGML_METAL=ON cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)` 3. Finally, run it with (you can adjust arguments): `./build/bin/llama-server -m ~/Downloads/Bonsai-8B.gguf --port 80 --host 0.0.0.0 --ctx-size 0 --parallel 4 --flash-attn on --no-perf --log-colors on --api-key some_api_key_string` Model was first downloaded from: https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main
	▲	freakynit 2 days ago \| parent [-]
		To the author: why is this taking 4.56GB ? I was expecting this to be under 1GB for 4B model. https://ibb.co/CprTGZ1c And this is when Im serving zero prompts.. just loaded the model (using llama-server).