Remix.run Logo
dylan604 17 hours ago

"ah, you hesitated" no more so than on every single other question.

the delay for the GPT to process a response is very unnerving. I find it worse than when the news is interviewing a remote site with a delay between responses. maybe if the eyes had LEDs to indicate activity rather than it just sitting there??? waiting for a GPT to do its thing is always going to force a delay especially when pushing the request to the cloud for a response.

also, "GPT-4o continuously listens to speech through the audio stream," is going to be problematic

jszymborski 17 hours ago | parent | next [-]

I wonder how well suited some of the smaller LLMs like Qwen 0.6B would be suited to this... it doesn't sound like a super complicated task.

I also feel like you can train a model on this task by using the zero-shot performance of larger models to create a dataset, making something very zippy.

accrual 16 hours ago | parent [-]

I wondered similar. Perhaps a local model cached in a 16GB or 24GB graphics card would perform well too. It would have to be a quantized/distilled model, but maybe sufficient, especially with some additional training as you mentioned.

jszymborski 16 hours ago | parent | next [-]

If Qwen 0.6B is suitable, then it could fit in 576MB of VRAM[0].

https://huggingface.co/unsloth/Qwen3-0.6B-unsloth-bnb-4bit

otabdeveloper4 14 hours ago | parent | prev [-]

16Gb is way overkill for this.

accrual 17 hours ago | parent | prev | next [-]

> also, "GPT-4o continuously listens to speech through the audio stream," is going to be problematic

This seems like a good place to leverage a wake word library, perhaps openWakeWord or porcupine. Then the user could wake the device before sending the prompt off to an endpoint.

It could even have a resting or snoozing animation, then have it perk up when the wake word triggers. Eerie to view, I'm sure...

https://github.com/dscripka/openWakeWord

https://github.com/Picovoice/porcupine

datameta 16 hours ago | parent | next [-]

This also saves energy to the point of enabling this device to be wireless.

surfandshow 13 hours ago | parent | prev [-]

[dead]

phh 3 hours ago | parent | prev | next [-]

Kyutai's unmute has great latency, but requires a fast small-ish, non-thinking, non-tooled LLM. What I'm currently working on is merging both worlds. Take the small LLM for instant response, which will basically just be able to repeat what you said, to show it understood. And have a big LLM do stuff in the background, and feeding back infos to the small LLM to explain intermediary steps.

endymion-light 2 hours ago | parent [-]

This is the key aspect for future development of models - small instant reasoning, ideally on device that funnels through tho a larger model for reasoning.

justusthane 16 hours ago | parent | prev | next [-]

> the delay for the GPT to process a response is very unnerving

I'm not sure I agree. The way the tentacle stops moving and shoots upright when you start talking to it gives me the intuitive impression that it's paying attention and thinking. Pretty cute!

dylan604 16 hours ago | parent [-]

it's the "thinking" frozen state while it uploads and waits for a GPT response that is unnerving. if the eyes did something to indicate progress is being made, then it would remove the desire to ask it if it is working or something. the last thing I want to be is that PM asking for a status update, but some indication it was actually processing the request would be ideal. even if there was a new animation with the tail like having it spinning or twirling like the ubiquitous spinner to show that something is happening

the snap to attention is a good example of it showing you feedback. the frozen state makes me wonder if it is doing anything or not

lsaferite 14 hours ago | parent [-]

Back when Anki (the robotics company) was building Cosmo, a *lot* of thought was put into making it expressive about everything that was going on. It really did a good job of making it feel "alive" for lack of a better word.

tetha 15 hours ago | parent | prev | next [-]

It clearly needs eyebrows like Johnny 5.

https://www.youtube.com/watch?v=l0zmCUVB0Yw

nebulous1 8 hours ago | parent | prev | next [-]

> "ah, you hesitated" no more so than on every single other question.

It was longer. I think almost twice as long. Took about 2 seconds to respond generally, 4 seconds for that one.

micromacrofoot 14 hours ago | parent | prev [-]

beyond the prototyping phase, which hosted models make very easy, there's little reason this couldn't use a very small optimized model on device... it would be significantly faster/safer in an end product (but significantly less flexible for prototyping)