| ▲ | red2awn 3 days ago | |
This is a stack of models: - 650M Audio Encoder - 540M Vision Encoder - 30B-A3B LLM - 3B-A0.3B Audio LLM - 80M Transformer/200M ConvNet audio token to waveform This is a closed source weight update to their Qwen3-Omni model. They had a previous open weight release Qwen/Qwen3-Omni-30B-A3B-Instruct and a closed version Qwen3-Omni-Flash. You basically can't use this model right now since none of the open source inference framework have the model fully implemented. It works on transformers but it's extremely slow. | ||