Why does Microsoft keep releasing models trained on synthetic data? Is it possible their contract with OpenAI won't let them do anything else?

I would think Microsoft, of all companies, would want to be working on their own LLM behind the scenes, even if they're relying on OpenAI for the bulk of their work.

Meta seems to be the only US company releasing big 'open source' models, while Chinese companies continue to release many completely open source LLMs.

▲

vineyardmike 7 hours ago | parent | next [-]

I don’t think there’s any strict reason they can’t from their contract. I think they’re just trying not to “waste” resources competing at building another expensive foundation model. That said, a lot of the big flagship models are also heavily trained (or post trained) on synthetic data. Microsoft has done a lot of application-specific fine tuning research.

This model in particular makes sense to be synthetic though. It’s explicitly trained to control a computer, and I doubt there’s a large enough amount of public training data on this use case.

I suspect that Chinese models are largely forced to open source as a trust building step because of general China-phobia in the west. There’s tons of stellar LLMs available from major US companies if you’re just using an API. It’s also a convenient marketing and differentiation opportunity. Some of the companies behind the bigger “agentic” models have started to offer a cheap subscription alternative to US companies. If they build up a big enough business I wouldn’t be surprised if they stop open sourcing right away.

	▲	fisf 22 minutes ago \| parent [-]
		> I suspect that Chinese models are largely forced to open source as a trust building step because of general China-phobia in the west. The obvious bias of the models, when it comes to Chinese politics and history, certainly does not help here.

▲

jillesvangurp an hour ago | parent | prev | next [-]

It's a cost and time saving measure. Human labeling is hard to scale and it takes time. With synthetic data, they can train faster and cheaper and speed up the pace at which they produce new models and run experiments with new types of models. Grok is doing similar things. It's smart.

▲

freehorse 2 hours ago | parent | prev | next [-]

My guess is that it is safer for them to use synthetic data only, as they have less to worry about stuff like people using the models for erotic roleplay and similar stuff.

▲

dev_hugepages 2 hours ago | parent | prev | next [-]

They're not very skilled

▲

yousif_123123 8 hours ago | parent | prev | next [-]

Perhaps they want to be able to run them on mobile hardware they release?

	▲	pogue 7 hours ago \| parent [-]
		I can definitely see them wanting to have models that can run on Windows computers or Surface tablets locally - although their focus seems to be sticking CoPilot into absolutely anything and everything possible, but why synthetic data models? Other companies have made small parameter models, but they don't seem to keep them up to date (correct me if I'm wrong).

▲

Mars008 6 hours ago | parent | prev [-]

> Why does Microsoft keep releasing models trained on synthetic data?

Why not? That's the way to go. In some domains the only way to go.