if anyone is looking to get into this. pipecat is a great open-source repo and community. https://github.com/pipecat-ai/pipecat

▲

pncnmnp 5 hours ago | parent | next [-]

I wish I had known about Pipecat a lot sooner. I found out about it a few weeks back, and since Gemma 4 launched, I've been building my own entirely local voice assistant using Gemma 4 + Kokoro TTS + Whisper from scratch - https://github.com/pncnmnp/strawberry.

Pipecat's smart turn model is really good for VAD - https://huggingface.co/pipecat-ai/smart-turn-v3

▲

zarldev 3 hours ago | parent | next [-]

Yeah Gemma4 was and is great fun to do this with - I too am building pretty much the same as yourself in Go.

https://github.com/zarldev/zarl & https://www.zarl.dev/posts/hal-by-any-other-name

	▲	jaggederest 2 hours ago \| parent [-]
		Looks like everyone is building one of these, I have my own little version that's using streaming STT, it can actually be too fast in some cases, and I have a little ring buffer grabbing audio from before the wake word detection fires (so it can hear "Hey Jarvis, turn on the lights" without deliberate pause) https://github.com/jaggederest/pronghorn/

▲

AnthOlei 4 hours ago | parent | prev [-]

What do you have going on the hardware side? I want to plug this into hass but don’t know what hardware I need for reasonable latency

▲

pncnmnp 4 hours ago | parent | next [-]

The whole setup works on my M2 MacBook Pro with 16 GB RAM. I use Gemma 4B via LiteRT-LM.

I've found that LiteRT-LM has a much lower DRAM footprint than Ollama. I've also made tons of optimizations in the code - for eg, you can do quite a bit with a 16k context window for a voice assistant while managing a good footprint, so I keep track of the token usage and then perform an auto-compaction after a while. I use sub-agents and only do deep-think calls with them, so the context window is separated out. In a multi-turn conversation, if Gemma 4 directly processes audio input, the KV cache fills up within a few turns, so I channel it all via Whisper.

Also, by far the biggest optimization is: 3-stage producer-consumer architecture. The LiteRT-LM streams tokens and I split them into sentences. A synthesizer thread then converts each sentence to audio via Kokoro TTS - the main thread then plays audio chunks sequentially. There's a parallel barge-in monitor thread. https://github.com/pncnmnp/strawberry/blob/main/main.py#L446

I did not want to use openWakeWord or Picovoice because they had limitations on which wake word you could choose. Alternative was to train a model of my own. So I created my own wake word detection pipeline using Whisper Tiny - works surprisingly well: https://github.com/pncnmnp/strawberry/blob/main/main.py#L143...

Also, I have VAD going with smart turn v3 (like I mentioned above) + I use browser/websocket for AEC + Barge-in (https://github.com/pncnmnp/strawberry/blob/main/audio_ws.py).

I'm using the MacBook's built-in microphones for this, though, and I haven't fully tested it with other microphones. I've been ironing out the rough edges on a daily basis. I should write a quick blog on this too.

▲

Sean-Der 4 hours ago | parent | prev [-]

Check out [0]. You can do 'Voice AI' on small/cheap hardware. It's the most fun you can have in the space ATM :) It's been a while, but posted a demo here [1]

[0] https://github.com/pipecat-ai/pipecat-esp32

[1] https://www.youtube.com/watch?v=6f0sUEUuruw

	▲	AnthOlei 4 hours ago \| parent [-]
		beautiful demo - is it running fully locally or talking to 3rd party API’s? That box was jaw dropping small

▲

BoxedEmpathy 5 hours ago | parent | prev [-]

I've been looking at this! Great project.