I've got a very real world use case I use DistilBERT for - learning how to label wordpress articles. It is one of those things where it's kind of valuable (tagging) but not enough to spend loads on compute for it.

The great thing is I have enough data (100k+) to fine-tune and run a meaningful classification report over. The data is very diverse, and while the labels aren't totally evenly distributed, I can deal with the imbalance with a few tricks.

Can't wait to swap it out for this and see the changes in the scores. Will report back

▲

minimaxir 6 days ago | parent | next [-]

ModernBERT may be a better base model if finetuning a model for a specific use case: https://huggingface.co/blog/modernbert

	▲	diwank 6 days ago \| parent [-]
		also ettin is a new favorite and a solid alternative: https://huggingface.co/jhu-clsp/ettin-encoder-1b I'd encourage you to give setfit a try, along with aggressively deduplicating your training set, finding top ~2500 clusters per label, and using setfit to train multilabel classifier on that. Either way- would love to know what worked for you! :)

▲

ramoz 6 days ago | parent | prev | next [-]

Please provide updates when you have them.

▲

weird-eye-issue 6 days ago | parent | prev [-]

It's going to perform badly unless you have very few tags and it's easy to classify them

	▲	AJRF 6 days ago \| parent [-]
		You can solve this by training a model per taxonomy, then wrap the individual models into a wrapper model to output joint probabilities. The largest amount of labels I have in a taxonomy is 8.