Only to prompt thought on this exact question, im interested in answers:

I just ran a benchmark against haiku of a very simple document classification task that at the moment we farm out to haiku in parallel. very naive same prompt system via same api AWS bedrock, and can see that the a few of the 4b models are pretty good match, and could be easily run locally or just for cheap via a hosted provider. The "how much data and how much improvement" is a question i dont have a good intuition for anymore. I dont even have an order of magnitude guess on those two axis.

Heres raw numbers to spark discussion:

|---------------|----------|-------|----------|-----------|

| llama-70b -----| 83 | 98 | 96 | $0.72 |

| gpt-oss-20b --| 83 | 97 | 92 | $0.07 |

| ministral-14b -| 84 | 100 | 90 | $0.20 |

| gemma-4b ----| 75 | 93 | 91 | $0.04 |

| glm-flash-30b -| 83 | 93 | 90 | $0.07 |

| llama-1b ------| 47 | 90 | 58 | $0.10 |

percents are doc type (categorical), year, and subject name match against haiku. just uses the first 4 pages.

in the old world where these were my own in house models, id be interested in seeing if i could uplift those nubmers with traingin, but i haven't done that with the new LLMs in a while. keen to get even a finger to the air if possible.

Can easily generate tens of thousands of examples.

Might try myself, but always keen for an opinion.

_edit for table formatting_

▲ arkmm 4 hours ago | parent | next [-]

You can fine tune a small LLM with a few thousand examples in just a few hours for a few dollars. It can be a bit tricky to host, but if you share a rough idea of the volume and whether this needs to be real-time or batched, I could list some of the tradeoffs you'd think about.

Source: Consulted for a few companies to help them finetune a bunch of LLMs. Typical categorical / data extraction use cases would have ~10x fewer errors at 100x lower inference cost than using the OpenAI models at the time.

▲ faxmeyourcode 6 hours ago | parent | prev | next [-]

Labeling or categorization tasks like this are the bread and butter of small fine tuned models. Especially if you need outputs in a specific json format or whatever.

I did an experiment where I did very simple SFT on Mistral 7b and it was extremely good at converting receipt images into structured json outputs and I only used 1,000 examples. The difficulty is trying to get a diverse enough set of examples, evaling, etc.

If you have great data with simple input output pairs, you should really give it a shot.

▲ airstrike 7 hours ago | parent | prev | next [-]

if you add 2 spaces at the start of the line, you turn it into a code block

  like this

▲ andai 6 hours ago | parent [-]

  | Model | DocType% | Year% | Subject% | In $/MTok |

  |----------------|----|-----|----|-------|

  | llama-70b -----| 83 |  98 | 96 | $0.72 |

  | gpt-oss-20b ---| 83 |  97 | 92 | $0.07 |

  | ministral-14b -| 84 | 100 | 90 | $0.20 |

  | gemma-4b ------| 75 |  93 | 91 | $0.04 |

  | glm-flash-30b -| 83 |  93 | 90 | $0.07 |

  | llama-1b ------| 47 |  90 | 58 | $0.10 |

▲ 7 hours ago | parent | prev | next [-]

[deleted]

▲ 7 hours ago | parent | prev [-]

[deleted]