What's your evidence for that?

And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English or a mixture of languages, which is essentially what they did by starting with EuroLLM?

▲

embedding-shape 4 hours ago | parent [-]

Evidence? Not so much, I didn't realize I was defending a PhD thesis here.

I speak Spanish, and have talked with people who only speak Portuguese, either of the variants, and also talked with Portuguese people before how they see their language, comparing it with Brazilian Portuguese, and vice-versa. So basically based on vibes and experience.

> And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English

I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences. Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.

▲

KK7NIL 3 hours ago | parent [-]

> I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences.

I'm a dual citizen of Portugal and Brazil and I live in the US now, so that's my linguistic background. (Also studied bits of French, Russian, Latin and Greek.)

> Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.

Not only are your reasons not obvious, your conclusion is actually wrong.

If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals), it might actually make more sense to train it in any other language BUT Brazilian Portuguese (say, English), then fine-tune it for European Portuguese.

LLM's have shown to be very good at generalizing across languages (the transformer architecture literally comes from work on translators IIRC).

	▲	embedding-shape 2 minutes ago \| parent [-]
		> If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals) Oh, I wasn't aware that was their goal, would certainly be intuitive to avoid Brazilian Portuguese if that's the case, although I'm still not sure it actually makes sense to 100% avoid it for pre-training even if you're trying to avoid Brazilian bias, you can "skew" things pretty heavily in post-training if you so wish. Where can I read more about this goal, because it doesn't seem to be mentioned in the submission article, just a short off-hand about one of the benchmarks, so I'm guessing there is some resource they talk more about the specifically perhaps?