Remix.run Logo
whinvik 6 days ago

Curious. Are there real world usecases where people have finetuned such tiny models and put them into production.

itake 6 days ago | parent | next [-]

My job uses tiny models to determine escalations to bigger models. The tiny model provides a label and if it’s high confidence, we escalate to ChatGPT confirm.

I also want to try this with language detection. Existing open source ML models have weaknesses for mixed language, length of text, or domain limitations in the underlying text (like trained on bible translations).

deepsquirrelnet 6 days ago | parent | prev | next [-]

I’m not sure what I’d use them for, except maybe tag generation? Encoders of this size usually outperform by a wide margin on tasks they would overlap with.

dismalaf 6 days ago | parent [-]

I'm making an app where literally all I want to do with an LLM is generate tags. This model has failed with flying colours, literally takes forever to parse anything and doesn't follow instructions.

Edit - I should add, currently the model I'm using is Gemini Flash Lite through the Gemini API. It's a really good combo of fast, follows instructions, gives correct results for what I want and cost-effective. I still would love a small open model that can run on edge though.

coder543 6 days ago | parent | next [-]

I'm pretty sure you're supposed to fine tune the Gemma 3 270M model to actually get good results out of it: https://ai.google.dev/gemma/docs/core/huggingface_text_full_...

Use a large model to generate outputs that you're happy with, then use the inputs (including the same prompt) and outputs to teach 270M what you want from it.

deepsquirrelnet 6 days ago | parent | prev | next [-]

Oof. I also had it refuse an instruction for “safety”, which was completely harmless. So that’s another dimension of issues with operationalizing it.

thegeomaster 6 days ago | parent | prev [-]

Well, Gemini Flash Lite is at least one, or likely two orders of magnitude larger than this model.

dismalaf 6 days ago | parent [-]

That's fair but one can dream of being able to simply run a useful LLM on CPU on your own server to simplify your app and save costs...

TrueDuality 5 days ago | parent | prev | next [-]

We're currently running ~30 Llama 3.1 models each with a different fine-tuned LoRa layer for their specific tasks. There was some initial pain as we refined the prompts but have been stable and happy for a while.

Since the Qwen3 0.6B model came out we've been training those. We can't quite compare apples-to-apples, we have a better deeper training data-set from pathological cases and exceptional cases that came out of our production environment. Those right now are looking like they're about at parity with our existing stack for quality and quite a bit faster.

I'm going to try and run through one of our training regimen with this model and see how it compares. Not quite running models this small yet, but it wouldn't surprise me if we could.

marcyb5st 6 days ago | parent | prev | next [-]

I built a reranker for a RAG system using a tiny model. After the candidate generation (i.e. vector search + BM25) and business logic filters/ACL checks the remainder of the chunks went through a model that given the user query told you whether or not the chunk was really relevant. That hit production, but once the context size of models grew that particular piece was discarded as passing everything yielded better results and prices (the fact that prices of input tokens went down also played a role I am sure).

So only for a while, but it still counts :)

nevir 6 days ago | parent | prev | next [-]

IIRC that Android (at least Pixel devices) use fine-tuned Gemma model(s) for some on-device assistant things

cyanydeez 6 days ago | parent | prev [-]

9gag.com commenter