Remix.run Logo
cjbgkagh 8 hours ago

I’m running Gemma4 31B (Q8) on my 2 4090s (48GB) with no problem.

Glemllksdf 7 hours ago | parent [-]

I have the same setup but tried paperclip ai with it and it seems to me that either i'm unable to setup it properly or multiply agents struggle with this setup. Especially as it seems that paperclip ai and opencode (used for connection) is blowing up the context to 20-30k

Any tips around your setup running this?

I use lmstudio with default settings and prioritization instead of split.

cjbgkagh 6 hours ago | parent [-]

I asked AI for help setting it up. I use 128k context for 31B and 256k context for 26B4A. Ollama worked out of the box for me but I wanted more control with llama.cpp.

My command for llama-server:

llama-server -m /models/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf -ngl 99 -sm layer -ts 10,12 --jinja --flash-attn on --cont-batching -np 1 -c 262144 -b 4096 -ub 512 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 --port 8080 --timeout 18000