Smaller models might not make the best agentic coding assistants, but I have a 128GB RAM headless machine serving llama.cpp with a number of local models that handles various tasks on a daily basis and works great.

- Qwen3-VL:30b > A file watcher on my NAS sends new images to it, which autocaptions and adds the text descriptions as a hidden EXIF layer into the image along with an entry into a Qdrant vector database for lossy searching and organization.

- Gemma3:27b > Used for personal translation work (mostly English and Chinese). Haven't had a chance to try out the Gemma4 models yet.

- Llama3.1:8b > Performs sentiment analysis on texts / comments / etc.

▲

verdverm 10 hours ago | parent [-]

Look into updating to Gemma4 and Qwen3.6, they are good at agentic things. qwen36moe with unsloth's 8bit quant is my daily driver now.

▲

nateb2022 9 hours ago | parent [-]

Have you noticed a gap between 8bit and 4bit quant? I've always ran 4bit quant cause less memory required

	▲	verdverm 6 hours ago \| parent [-]
		I run the biggest quant because it is more capable, spark has enough memory for two qwen at 8bit and full context length (roughly 48G each) I find gemini/gemma to have become worse at coding, they are better for non-coding tasks, but maybe not even that, the hallucinations and instruction following have both degraded ime