I'm only superficially familiar with these, but curious. Your comment above mentioned the VL model. Isn't that a different model or is there an a3b with vision? Would it be better to have both if I'd like vision or does the vision model have the same abilities as the text models?

▲ solarkraft 15 hours ago | parent | next [-]

Looks like it: https://ollama.com/library/qwen3-vl:30b-a3b

▲ thot_experiment 5 hours ago | parent [-]

fwiw on my machine it is 1.5x faster to inference in llama.cpp, these the settings i use for inference for the qwen i just keep in vram permanently

    llama-server --host 0.0.0.0 --model Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf --mmproj qwen3-VL-mmproj-F16.gguf --port 8080 --jinja --temp 0.7 --top-k 20 --top-p 0.8 -ngl 99 -c 65536 --repeat_penalty 1.0 --presence_penalty 1.5

▲ mark_l_watson 13 hours ago | parent | prev [-]

This has been my question also: I spend a lot of time experimenting with local models and almost all of my use cases involve text data, but having image processing and understanding would be useful.

How much do I give up (in performance, and running on my 32G M2Pro Mac) using the VL version of a model? For MOE models, hopefully not much.

	▲	thot_experiment 5 hours ago \| parent [-]
		all the qwen flavors have a VL version and it's a separate tensor stack, just a bit of vram if you want to keep it resident and vision-based queries take longer to process context but generation is still fast asf i think the model itself is actually "smarter" because they split the thinking and instruct models so both modalities become better in their respective model i use it almost exclusively to OCR handwritten todo lists into my todo app and i don't think it's missed yet, does a great job of toolcalling everything