I keep seeing these "sovereign" LMs time and time again. In Sweden we had GPT-SW3 (https://www.ai.se/en/project/gpt-sw3) and same story there. Instead of burning money on "sovereign" claims, national research labs should instead focus on building on top of solid baselines (like Qwen/Kimi) and finetuning frontier models with real agentic utility that can be applied across actual use cases and can be widely used by its people, basically for free. Nations should mirror what Cursor has done with Composer 2.5 for example.

▲

TJSomething 3 hours ago | parent | next [-]

If open frontier models start closing up and states start more export controls on AI services and hardware, it might be good to ensure the supply chain is there to reproduce the SotA, or even a couple generations behind it.

▲

627467 an hour ago | parent | prev | next [-]

Do we know for sure how much national corpus of knowledge (like dutch) goes into these "global" models and how that affects "localized" model biases? What's wrong with specialized models?

▲

thevinter 4 hours ago | parent | prev | next [-]

And what happens once the "solid baselines" become unavailable for a reason or the other?

▲

zozbot234 3 hours ago | parent | next [-]

You keep building on the last available version? Fine tuning is a whole lot cheaper, easier and more useful than pretraining a model from scratch. It's a complete no brainer.

	▲	rapidfl 2 hours ago \| parent [-]
		> You keep building on the last available version? yes but a sovereign can allocate some resources and a few people to stay in the loop from a first principles level. No need to wait for a rug pull. Of course, it can not compete with the frontier labs. But good to have researchers and professors "in-house". LLMs are here for the long-term.

▲

3 hours ago | parent | prev | next [-]

[deleted]

▲

ozim 3 hours ago | parent | prev [-]

Seems like you don’t understand.

You take current version and build on top of it. You have the weights.

You might not get some n+1 version at some point but the n version you will have will be still most likely much better than whatever you come up with burning good will money of people believing in „sovereignty”.

You are not getting ahead in this game by being „true to your local values” capital expenditure is insane in this game.

▲

mschuster91 3 hours ago | parent | prev [-]

Kimi and Qwen come out of China, which means that their training material may be biased e.g. relating to Taiwan [1]. In addition, there is no way to determine what input went into the training, if it was properly licensed, if it was legal (e.g. not contaminated by CSAM), or how the human component of RLHF was sourced - in US models, for example, stories about exploitation like [2] have been floating for years.

Assuming us Europeans finally get our act together, I think it is better for our long-term future (and the ethical problems) if we manage to get a baseline of training input and data ourselves, from scratch, with everything being ethically sourced.

Oh and, while we're at it, the EU has 24 official languages plus a host of minority languages. Most LLMs focus on the English, German, French and Chinese languages, but everything else is... left behind at best. An European model with actual funding and proper data sources might be able to significantly reduce that.

[1] https://www.taiwannews.com.tw/news/6245677

[2] https://www.theguardian.com/technology/2024/apr/16/techscape...

▲

dr_dshiv 2 hours ago | parent | next [-]

There is something north of 8% OCR error rates.. that will hurt model quality!

▲

gnerd00 an hour ago | parent | prev | next [-]

> Most LLMs focus on the English, German, French and Chinese languages, but everything else is... left behind at best.

that is not true, so please read before make an opinion. The French Mistral project shipped seven+ years ago with 140 languages for example.. language translation was the first LLM task from 2015

▲

siva7 2 hours ago | parent | prev [-]

Uh, some would say it's easy to determine what input went into the training for kimi and qwen.. since they were caught stealing it from American labs. Some cultural cliches may never change.

	▲	janc_ an hour ago \| parent \| next [-]
		It's well-known that all commercial models are based on stolen content. That doesn't mean there is no filtering/censoring, just that the censoring likely depends on where it's happening…
	▲	ignoramous an hour ago \| parent \| prev [-]
		> since they were caught stealing it from American labs. Some cultural cliches may never change. Has a formal lawsuit been brought to bear? Given, Anthropic & OpenAI are being dragged through courts for copyright violation (or stealing, as you'd call it, if the companies involved were culturally Chinese) by newspapers, publishing houses etc; one'd think they'd pass on some of that medicine to Alibaba, which does have business entities registered in the US.