Remix.run Logo
dileeparanawake 6 days ago

This is cool. For on device models any plans / models that use MOE in relatively resource constrained setups (I’m thinking MBP M1 16gb ram)? I’m using LM studio but all the Gemma models (mlx) seem to crash but surprisingly managed to get gpt-oss 20b working (slow) on my mbp.

I find performance in resource constrained environments interesting.

In particular trying to find decent code models (on device backup) but also tts applications and voice to text.

canyon289 6 days ago | parent [-]

We constantly are evaluating architectures trying to assess what will work well in the open ecosystem. It's quite a vibrant space and glad you have one option that works. For this model in particular we evaluated a couple of options before choosing a dense architecture of its simplicity and finetunability.

For the other Gemma models, some the smaller sizes should work on your laptop when quantized. Does Gemma 1b and 4b not work on a quantized? It should fit the memory constraints. I use Ollama on low powered devices with 8gb and less of ram and the models load.

For TTS a colleague at HuggingFace made this bedtime story generator running entirely in browser.

https://huggingface.co/spaces/webml-community/bedtime-story-... https://www.youtube.com/watch?v=ds95v-Aiu5E&t https://huggingface.co/spaces/webml-community/bedtime-story-...

Be forewarned though this is not a good coding model out of the box. It likely could be trained to be an autocompletion llm, but with 32k context window and smaller sides its not going to be refactoring entire codebases like Jules/Gemini and other larger models can.