| ▲ | Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon(github.com) |
| 171 points by MediaSquirrel 12 hours ago | 22 comments |
| About six months ago, I started working on a project to fine-tune Whisper locally on my M2 Ultra Mac Studio with a limited compute budget. I got into it. The problem I had at the time was I had 15,000 hours of audio data in Google Cloud Storage, and there was no way I could fit all the audio onto my local machine, so I built a system to stream data from my GCS to my machine during training. Gemma 3n came out, so I added that. Kinda went nuts, tbh. Then I put it on the shelf. When Gemma 4 came out a few days ago, I dusted it off, cleaned it up, broke out the Gemma part from the Whisper fine-tuning and added support for Gemma 4. I'm presenting it for you here today to play with, fork and improve upon. One thing I have learned so far: It's very easy to OOM when you fine-tune on longer sequences! My local Mac Studio has 64GB RAM, so I run out of memory constantly. Anywho, given how much interest there is in Gemma 4, and frankly, the fact that you can't really do audio fine-tuning with MLX, that's really the reason this exists (in addition to my personal interest). I would have preferred to use MLX and not have had to make this, but here we are. Welcome to my little side quest. And so I made this. I hope you have as much fun using it as I had fun making it. -Matt |
|
| ▲ | LuxBennu 12 hours ago | parent | next [-] |
| I run whisper large-v3 on an m2 max 96gb and even with just inference the memory gets tight on longer audio, can only imagine what fine-tuning looks like. Does the 64gb vs 96gb make a meaningful difference for gemma 4 fine-tuning or does it just push the oom wall back a bit? Been wanting to try local fine-tuning on apple silicon but the tooling gap has kept me on inference only so far. |
| |
| ▲ | weitendorf 8 hours ago | parent | next [-] | | Hey I was literally just working on this today (I was racing ahead on an audio FT myself but OP beat me by a few hours). For audio inference definitely try running your input through VAD first to drop junk data and if necessary, as one of several preprocessing steps before sending the audio to the large model. You can check out how I did it here: https://github.com/accretional/vad/blob/main/pkg/vad/vad.go I was using https://huggingface.co/onnx-community/pyannote-segmentation-... because with ONNX, I could run it on Intel servers with vectorized instructions, locally on my Mac, AND in-browser with transformers.js VAD is absurdly time-effective (I think like O(10s) to segment 1hr of audio or something) and reduces the false positive rate/cost of transcription and multimodal inference since you can just pass small bits of segmented audio into another model specializing in that, then encode it as text before passing it to the expensive model. | | |
| ▲ | MediaSquirrel 8 hours ago | parent [-] | | Great minds think alike! Also, I had a huge head start, as I spent a month or two working on this in September 2025, shelved it and dusted it back off this weekend. | | |
| ▲ | weitendorf 8 hours ago | parent [-] | | Excellent work still, your repo is much more robust and fleshed out and I am just beelining straight to audio LoRa not really knowing what I'm doing, as this is my first time attempting a ~real ML training project. I think in https://github.com/mattmireles/gemma-tuner-multimodal/blob/m... and https://github.com/mattmireles/gemma-tuner-multimodal/blob/m... and https://github.com/mattmireles/gemma-tuner-multimodal/blob/m... you have a superset of the various cludges I have in my finetuning repo, I'm going to study this and do what I can to learn from it. Really appreciate you sharing it here! Definitely interested in swapping notes if you are though. Probably the biggest thing that came out of this exercise for us was realizing that Apple actually has some really powerful local inference/data processing tools available locally, they just are much more marketed towards application developers so a lot of them fly under the radar. We just published https://github.com/accretional/macos-vision to make it easy for anybody to use Apple's local OCR, image segmentation, foreground-masking, facial analysis, classification, and video tracking functionality accessible via CLI and hopefully more commonly in ML and data workloads. Hopefully you or someone else can get some use of it. I definitely will from yours! | | |
| ▲ | MediaSquirrel 6 hours ago | parent [-] | | Look inside here: https://github.com/mattmireles/gemma-tuner-multimodal/tree/m... Here’s the trick: use Gemini Pro deep research to create “Advanced Hacker’s Field Guide for X” where X is the problem that you are trying to solve. Ask for all the known issues, common bugs, unintuitive patterns, etc. Get very detailed if you want. Then feed that to Claude / Codex / Cursor. Basically, create a cheat sheet for your AI agents. This will unlock a whole new level of capability. I’m @mattmireles on Twitter — feel free to DM me. |
|
|
| |
| ▲ | MediaSquirrel 12 hours ago | parent | prev | next [-] | | Memory usage increases quadratically with sequence length. Therefore, using shorter sequences during fine-tuning can prevent memory explosions. On my 64GB RAM machine, I'm limited to input sequences of about 2,000 tokens, considering my average output for the fine-tuning task is around 1,000 tokens (~3k tokens total). | | |
| ▲ | zozbot234 6 hours ago | parent | next [-] | | Shouldn't FlashAttention address the quadratic increase in memory footprint wrt.
fine-tuning/training? I'm also pretty sure that it does not apply to pure inference due to how KV-caching works. | |
| ▲ | LuxBennu 9 hours ago | parent | prev [-] | | Ah that makes sense, quadratic scaling is brutal. So with 96gb i'd probably get somewhere around 4-5k total sequence length before hitting the wall, which is still pretty limiting for anything multimodal. Do you do any gradient checkpointing or is that not worth the speed tradeoff at these sizes? | | |
| |
| ▲ | MediaSquirrel 8 hours ago | parent | prev [-] | | re: Whisper v3 -- how is this possible? Whisper has a 30s context window. You have to chunk it. |
|
|
| ▲ | conception 9 hours ago | parent | prev | next [-] |
| I’m pretty excited about the edge gallery ios app with gemma 4 on it but it seems like they hobbled it, not giving access to intents and you have to write custom plugins for web search, etc. Does anyone have a favorite way to run these usefully? ChatMCP works pretty well but only supports models via api. |
|
| ▲ | craze3 12 hours ago | parent | prev | next [-] |
| Nice! I've been wanting to try local audio fine-tuning. Hopefully it works with music vocals too |
|
| ▲ | mandeepj 7 hours ago | parent | prev | next [-] |
| > I had 15,000 hours of audio data do you really need that much data for fine-tuning? |
| |
| ▲ | MediaSquirrel 6 hours ago | parent [-] | | More data -> better, faster on-device models The actual plan was to distill Gemini 2.5 Pro into the best on-device voice dictation model. Pretty sure it would have worked. Alas. | | |
| ▲ | nomel 6 hours ago | parent [-] | | Reasons for running local aside... What is the practical latency difference you see between on-device and, say, whisper, in streaming mode, over the internet? Comparable? Seems that internet latency would be mostly negligible (assuming reasonable internet/cell coverage), or at least compensated for by the higher end hardware on the other side? | | |
| ▲ | MediaSquirrel 3 hours ago | parent [-] | | depends on the model! If you run a smaller whisper-distil variant AND you optimize the decoder to run on Apple Neural Engine, you can get latency down to ~300ms without any backend infra. The issue is that the smaller models tend to suck, which is why the fine-tuning is valuable. My hypothesis is that you can distill a giant model like Gemini into a tiny distilled whisper model. but it depends on the machina you are running, which is why local AI is a PITA. |
|
|
|
|
| ▲ | neonstatic 7 hours ago | parent | prev | next [-] |
| Just a heads up, that I found NVIDIA Parakeet to be way better than Whisper - faster, uses less compute, the output is better, and there are more options for the output. I am using parakeet-mlx from the command line. Check it out! |
| |
| ▲ | MediaSquirrel 3 hours ago | parent [-] | | yeah, it came out after I stared on my project last year. Only issue is that you can't fine-tune it on Apple Silicon. |
|
|
| ▲ | dsabanin 12 hours ago | parent | prev | next [-] |
| Thanks for doing this. Looks interesting, I'm going to check it out soon. |
| |
|
| ▲ | yousifa 12 hours ago | parent | prev | next [-] |
| This is super cool, will definitely try it out! Nice work |
|
| ▲ | pivoshenko 11 hours ago | parent | prev [-] |
| nice! |