| ▲ | red2awn 3 days ago |
| None of inference frameworks (vLLM/SGLang) supports the full model, let alone non-nvidia. |
|
| ▲ | AndreSlavescu 3 days ago | parent | next [-] |
| We actually deployed working speech to speech inference that builds on top of vLLM as the backbone. The main thing was to support the "Talker" module, which is currently not supported on the qwen3-omni branch for vLLM. Check it out here:
https://models.hathora.dev/model/qwen3-omni |
| |
| ▲ | sosodev 3 days ago | parent | next [-] | | Is your work open source? | |
| ▲ | red2awn 3 days ago | parent | prev [-] | | Nice work. Are you working on streaming input/output? | | |
| ▲ | AndreSlavescu 3 days ago | parent [-] | | Yeah, that's something we currently support. Feel free to try the platform out! No cost to you for now, you just need a valid email to sign up on the platform. | | |
| ▲ | valleyer 2 days ago | parent [-] | | I tried this out, and it's not passing the record (n.) vs. record (v.) test mentioned elsewhere in this thread. (I can ask it to repeat one, and it often repeats the other.) Am I not enabling the speech-to-speech-ness somehow? |
|
|
|
|
| ▲ | sosodev 3 days ago | parent | prev | next [-] |
| That's unfortunate but not too surprising. This type of model is very new to the local hosting space. |
|
| ▲ | whimsicalism 3 days ago | parent | prev [-] |
| Makes sense, I think streaming audio->audio inference is a relatively big lift. |
| |
| ▲ | red2awn 2 days ago | parent [-] | | Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server. | | |
| ▲ | whimsicalism 2 days ago | parent [-] | | I imagine you have to start decoding many speculative completions in parallel to have true low latency. |
|
|