Remix.run Logo
KK7NIL 3 hours ago

> I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences.

I'm a dual citizen of Portugal and Brazil and I live in the US now, so that's my linguistic background. (Also studied bits of French, Russian, Latin and Greek.)

> Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.

Not only are your reasons not obvious, your conclusion is actually wrong.

If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals), it might actually make more sense to train it in any other language BUT Brazilian Portuguese (say, English), then fine-tune it for European Portuguese.

LLM's have shown to be very good at generalizing across languages (the transformer architecture literally comes from work on translators IIRC).

embedding-shape 2 minutes ago | parent [-]

> If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals)

Oh, I wasn't aware that was their goal, would certainly be intuitive to avoid Brazilian Portuguese if that's the case, although I'm still not sure it actually makes sense to 100% avoid it for pre-training even if you're trying to avoid Brazilian bias, you can "skew" things pretty heavily in post-training if you so wish.

Where can I read more about this goal, because it doesn't seem to be mentioned in the submission article, just a short off-hand about one of the benchmarks, so I'm guessing there is some resource they talk more about the specifically perhaps?