Remix.run Logo
MediaSquirrel 8 hours ago

More data -> better, faster on-device models

The actual plan was to distill Gemini 2.5 Pro into the best on-device voice dictation model.

Pretty sure it would have worked. Alas.

nomel 8 hours ago | parent [-]

Reasons for running local aside...

What is the practical latency difference you see between on-device and, say, whisper, in streaming mode, over the internet? Comparable? Seems that internet latency would be mostly negligible (assuming reasonable internet/cell coverage), or at least compensated for by the higher end hardware on the other side?

MediaSquirrel 6 hours ago | parent [-]

depends on the model!

If you run a smaller whisper-distil variant AND you optimize the decoder to run on Apple Neural Engine, you can get latency down to ~300ms without any backend infra.

The issue is that the smaller models tend to suck, which is why the fine-tuning is valuable.

My hypothesis is that you can distill a giant model like Gemini into a tiny distilled whisper model.

but it depends on the machina you are running, which is why local AI is a PITA.