This looks like so much fun! I have recently gotten into working with electronics, so it seems like a nice little project to undertake.

I noticed that it is dependent on openAIs realtime API, so it got me wondering what open alternatives there are as I would love a more realtime alexa-like device in my home that doesnt contact the cloud. I have only played with software, but the existing solutions have never felt realtime to me.

I could only find <https://github.com/fixie-ai/ultravox> that would seem to really work as realtime. It seems to be some model that wires up llama and whisper somehow, rather than treating them as separate steps which is common with other projects.

What other options are available for this kind of real-time behaviour?

▲

Sean-Der 3 months ago | parent | next [-]

My plan is that Espressif’s WebRTC code[0] will hook up to pipe at [1] that gets you the freedom to do whatever you want.

The design of OpenAI + WebRTC was to lean on WebRTC as much as possible to make it easier for users.

[0] https://github.com/espressif/esp-webrtc-solution

[1] https://github.com/pipecat-ai/pipecat

	▲	akadeb 2 months ago \| parent \| next [-]
		Pipecat is awesome! is it similar to what livekit provides? I think Realtime API adoption would be higher if it is offered on Arduino rather than ESP-IDF as the latter is not very beginner friendly. That was one of the main reasons I built this repo using edge functions instead of a direct WebRTC connection.
	▲	supermatt 3 months ago \| parent \| prev [-]
		Fantastic! This will save a ton of work

▲

_neil 3 months ago | parent | prev | next [-]

Not on-device but for local network I’ve been looking at Speaches[0]. Haven’t tried it yet, but I have been running kokoru-web[1] and the quality and speed is really good.

[0] https://speaches.ai/ [1] https://huggingface.co/spaces/Xenova/kokoro-web

▲

3D30497420 3 months ago | parent | prev [-]

Maybe inspiration from how Home Assistant can do local speech-to-text and vice versa? https://www.home-assistant.io/voice_control/voice_remote_loc...

Pretty sure you'd need to host this on something more robust than an ESP32 though.

	▲	supermatt 3 months ago \| parent [-]
		Yeah, I was looking at home assistant as well, but it doesnt feel real-time, likely due to it having the transcription stage separate from the inference.