Kimi and Qwen come out of China, which means that their training material may be biased e.g. relating to Taiwan [1]. In addition, there is no way to determine what input went into the training, if it was properly licensed, if it was legal (e.g. not contaminated by CSAM), or how the human component of RLHF was sourced - in US models, for example, stories about exploitation like [2] have been floating for years.

Assuming us Europeans finally get our act together, I think it is better for our long-term future (and the ethical problems) if we manage to get a baseline of training input and data ourselves, from scratch, with everything being ethically sourced.

Oh and, while we're at it, the EU has 24 official languages plus a host of minority languages. Most LLMs focus on the English, German, French and Chinese languages, but everything else is... left behind at best. An European model with actual funding and proper data sources might be able to significantly reduce that.

[1] https://www.taiwannews.com.tw/news/6245677

[2] https://www.theguardian.com/technology/2024/apr/16/techscape...

▲

dr_dshiv 2 hours ago | parent | next [-]

There is something north of 8% OCR error rates.. that will hurt model quality!

▲

gnerd00 an hour ago | parent | prev | next [-]

> Most LLMs focus on the English, German, French and Chinese languages, but everything else is... left behind at best.

that is not true, so please read before make an opinion. The French Mistral project shipped seven+ years ago with 140 languages for example.. language translation was the first LLM task from 2015

▲

siva7 2 hours ago | parent | prev [-]

Uh, some would say it's easy to determine what input went into the training for kimi and qwen.. since they were caught stealing it from American labs. Some cultural cliches may never change.

	▲	janc_ an hour ago \| parent \| next [-]
		It's well-known that all commercial models are based on stolen content. That doesn't mean there is no filtering/censoring, just that the censoring likely depends on where it's happening…
	▲	ignoramous an hour ago \| parent \| prev [-]
		> since they were caught stealing it from American labs. Some cultural cliches may never change. Has a formal lawsuit been brought to bear? Given, Anthropic & OpenAI are being dragged through courts for copyright violation (or stealing, as you'd call it, if the companies involved were culturally Chinese) by newspapers, publishing houses etc; one'd think they'd pass on some of that medicine to Alibaba, which does have business entities registered in the US.