| ▲ | goodroot 7 hours ago | |
Ah yeah, longform is interesting. Not sure how you're running it, via whichever "app thing", but... On resource limited machines: "Continuous recording" mode outputs when silence is detected via a configurable threshold. This outputs as you speak in more reasonable chunks; in aggregate "the same output" just chunked efficiently. Maybe you can try hackin' that up? | ||
| ▲ | LuxBennu 7 hours ago | parent [-] | |
Yeah that makes sense, chunking on silence would sidestep the latency issue pretty cleanly. I've been running it through a basic fastapi wrapper so it just takes whatever audio blob gets thrown at it, no chunking logic on the server side. Might be worth adding a vad pass before sending to whisper though, would cut down on processing dead air too. | ||