Remix.run Logo
swiftcoder 4 hours ago

It is definitely an interesting problem, because Portugal is a small enough country that the actual total corpus of available texts in (non-Brazilian) Portuguese is potentially problematic.

fy20 3 hours ago | parent | next [-]

European Portuguese is the 13th most populous language in Europe. Not that small, there are many other European languages in use that are much smaller.

https://en.wikipedia.org/wiki/List_of_languages_by_number_of...

SkeuomorphicBee 29 minutes ago | parent | next [-]

What makes Portugal's situation unique is that it is a small population that is eclipsed in models by the bigger weights of the much bigger population of Brazil.

Yes, there are much smaller European countries, but those are generally the only source of truth for their specific language, so the context of a LLM query in that language steers the LLM towards facts from that country, for example, if I ask a big generic LLM something in Latvian then it most likely will answer something relevant to the context of Latvia. But Portugal, being the much smaller user of its language, have the somewhat unique problem that if I ask a generic model something in Portuguese it will probably answer something related to Brazil instead of Portugal.

Maybe the UK and Spain have somewhat similar struggles, but I suspect that none has it as bad as Portugal in that regard.

augusto-moura 2 hours ago | parent | prev | next [-]

It is pretty small when considering content output. It is only 11 million people, and only a fraction of them will be writing something that could be used on training datasests. If you look at the countries by scientific contribution, for example [1], Portugal is on the 28th position, while Brazil is in 14th by more than double the number of contributions.

Don't get me wrong, it is definitely impressive given Portugal's actual size, but I believe there's a hard limit for population and size that will be difficult to cross

[1]: https://en.wikipedia.org/wiki/List_of_countries_by_number_of...

depaulagu 3 hours ago | parent | prev [-]

> European Portuguese is the 13th most populous language in Europe

that's not impressive

senko 2 hours ago | parent [-]

Hello from 23rd

embedding-shape 4 hours ago | parent | prev [-]

I don't think so, Portugal the country might be small, with a small population, but there is ~250 million "Lusophones" (native Portuguese speakers), making it the fifth-most spoken native language in the world, I'd hardly call that small :) And before everyone screams; yes, European Portuguese is different from Brazilian Portuguese, but they're still both Portuguese and understand each other, so it's not like the text from one cannot be used to train a model for the other, or vice-versa.

All in all, I don't think that's a major issue here.

swiftcoder 4 hours ago | parent | next [-]

The authors are pretty clearly trying to draw only from European Portuguese sources - I feel like there's a fairly widespread attitude here that the language is being overwhelmed by the sheer number of Brazilian speakers (which there is obviously at least some truth to).

I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)

2 hours ago | parent | next [-]
[deleted]
madaxe_again 4 hours ago | parent | prev | next [-]

Man, there’s an attitude up here in trás-os-montes that the rest of Portugal has spoken unrecognisable trash for a century. It took me years to realise I’d learned hilariously antique Portuguese by moving there.

Then again, if you go to Miranda de Douro, they’ll say the rest of Portugal has been talking nonsense for the last 700 years, so the purists at least always have their concents to retreat to if they so choose.

philipwhiuk 3 hours ago | parent | prev [-]

> I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English).

That's easy to say when you're not on the other end of US defaultism.

augusto-moura 2 hours ago | parent | next [-]

To be fair, it is only natural: Portuguese itself only came to be because the Roman Empire conquered the Lusitan land [1], a lot of English comes from Norman French from the Norman conquest [2], the Americas didn't speak European languages until 500 years ago or so, etc.

If you give enough time, all languages will change, and some of them because of major political changes/conquests

[1]: https://en.wikipedia.org/wiki/Paleohispanic_languages

[2]: https://en.wikipedia.org/wiki/Influence_of_French_on_English

[3]: https://en.wikipedia.org/wiki/Indigenous_languages_of_the_Am...

swiftcoder an hour ago | parent | prev [-]

> That's easy to say when you're not on the other end of US defaultism.

I mean, I’m a Brit who lived a long time in the US, so that’s a dynamic with which I am rather familiar

evandrofisico 29 minutes ago | parent | prev | next [-]

Portugal has a growing Xenophobic attitude towards immigrants, specially Brazilians and this is reflected in linguistic prejudice.

They have concerns of portuguese children learning to "speak brazillian" because there is a lot more of video content being produced in Brasil than in Portugal and stuff like movies, videogames and software in general are avaliable in brazilian localization/adaptation first.

embedding-shape 19 minutes ago | parent [-]

We have the same thing happening, on multiple levels, here too. First some Spanish parents are afraid the children aren't listening and watching enough Spanish media. Then additionally, some Catalan parents are afraid the children don't get to use Catalan in school so they don't become proficient enough to use it in society.

darkwater 4 minutes ago | parent [-]

The Catalan situation is completely different and unrelated, being a completely different language and not endangered (with or without scary quotes, as you prefer) by an ex-colony that became independent. Actually many Catalans would like to be such ex-colony.

mghackerlady 3 hours ago | parent | prev | next [-]

Right, but most of those speak brazilian portuguese. There's so much less european portuguese text that it becomes impossible for a model to not speak brazilian portuguese if not trained in a way that ignores brazilian sources

KK7NIL 4 hours ago | parent | prev | next [-]

The whole point of this project is to have an LLM that speaks European Portuguese, not Brazilian Portuguese.

embedding-shape 4 hours ago | parent [-]

Right, and my point is that if you use 80% Brazilian Portuguese during base model training + 20% European Portuguese as post-training, you pretty much get exactly that, except with a ton more of available training data.

KK7NIL 4 hours ago | parent [-]

What's your evidence for that?

And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English or a mixture of languages, which is essentially what they did by starting with EuroLLM?

embedding-shape 4 hours ago | parent [-]

Evidence? Not so much, I didn't realize I was defending a PhD thesis here.

I speak Spanish, and have talked with people who only speak Portuguese, either of the variants, and also talked with Portuguese people before how they see their language, comparing it with Brazilian Portuguese, and vice-versa. So basically based on vibes and experience.

> And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English

I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences. Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.

KK7NIL 3 hours ago | parent [-]

> I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences.

I'm a dual citizen of Portugal and Brazil and I live in the US now, so that's my linguistic background. (Also studied bits of French, Russian, Latin and Greek.)

> Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.

Not only are your reasons not obvious, your conclusion is actually wrong.

If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals), it might actually make more sense to train it in any other language BUT Brazilian Portuguese (say, English), then fine-tune it for European Portuguese.

LLM's have shown to be very good at generalizing across languages (the transformer architecture literally comes from work on translators IIRC).

embedding-shape 4 minutes ago | parent [-]

> If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals)

Oh, I wasn't aware that was their goal, would certainly be intuitive to avoid Brazilian Portuguese if that's the case, although I'm still not sure it actually makes sense to 100% avoid it for pre-training even if you're trying to avoid Brazilian bias, you can "skew" things pretty heavily in post-training if you so wish.

Where can I read more about this goal, because it doesn't seem to be mentioned in the submission article, just a short off-hand about one of the benchmarks, so I'm guessing there is some resource they talk more about the specifically perhaps?

madaxe_again 4 hours ago | parent | prev [-]

Mutually intelligible, yes, but far from perfectly so. I speak both, as a native anglophone, and the difference is not so much “US vs British English” so much as “Guyanese English vs British English”. Like, fundamental points of grammar differ, the spoken rhythm and syllabic stress differs (poetry does not translate well between them), never mind just vocabulary. Continental Portuguese people tend to find it easier to understand brasileiros than vice versa, largely due to mostly one-way cultural exports, but to try to roll both into a single model would create a creole at best.

embedding-shape 4 hours ago | parent [-]

I agree, they're not the same. But they're far closer than other languages who don't come from the same families.