The model weights are 70GB (Hugging Face recently added a file size indicator - see https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/tree... ) so this one is reasonably accessible to run locally.

I wonder if we'll see a macOS port soon - currently it very much needs an NVIDIA GPU as far as I can tell.

▲

a_e_k 10 hours ago | parent | next [-]

That's at BF16, so it should fit fairly well on 24GB GPUs after quantization to Q4, I'd think. (Much like the other 30B-A3B models in the family.)

I'm pretty happy about that - I was worried it'd be another 200B+.

▲

zenmac 7 hours ago | parent [-]

are there any that would run on 16GB Apple M1?

	▲	bigyabai 7 hours ago \| parent [-]
		Not quite. The smallest Qwen3 A3B quants are ~12gb and use more like ~14gb depending on your context settings. You'll thrash the SSD pretty hard swapping it on a 16gb machine.

▲

growthwtf 10 hours ago | parent | prev | next [-]

A fun project for somebody who has more time than myself would be to see if they can get it working with the new Mojo stuff from yesterday for Apple. I don't know if the functionality would be fully baked out enough yet to actually do the port successfully, but it would be an interesting try.

▲

wsintra2022 6 hours ago | parent [-]

New Mojo stuff from Apple?

	▲	wsintra2022 6 hours ago \| parent [-]
		Nvm found it https://news.ycombinator.com/item?id=45326388

▲

dcreater 9 hours ago | parent | prev | next [-]

is there an inference engine for this on macos?

	▲	simonw 7 hours ago \| parent [-]
		Not yet as far as I can tell - might take a while for someone to pull that together given the complexity involved in handling audio and image and text and video at once.

▲

varispeed 7 hours ago | parent | prev [-]

Would it run on 5090? Or is it possible to link multiple GPUs or has NVIDIA locked it down?

	▲	axoltl 7 hours ago \| parent [-]
		It'd run on a 5090 with 32GB of VRAM at fp8 quantization which is generally a very acceptable size/quality trade-off. (I run GLM-4.5-Air at 3b quantization!) The transformer architecture also lends itself quite well to having different layers of the model running in different places, so you can 'shard' the model across different compute nodes.