| ▲ | embedding-shape 4 hours ago |
| I don't think so, Portugal the country might be small, with a small population, but there is ~250 million "Lusophones" (native Portuguese speakers), making it the fifth-most spoken native language in the world, I'd hardly call that small :) And before everyone screams; yes, European Portuguese is different from Brazilian Portuguese, but they're still both Portuguese and understand each other, so it's not like the text from one cannot be used to train a model for the other, or vice-versa. All in all, I don't think that's a major issue here. |
|
| ▲ | swiftcoder 4 hours ago | parent | next [-] |
| The authors are pretty clearly trying to draw only from European Portuguese sources - I feel like there's a fairly widespread attitude here that the language is being overwhelmed by the sheer number of Brazilian speakers (which there is obviously at least some truth to). I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English) |
| |
| ▲ | 2 hours ago | parent | next [-] | | [deleted] | |
| ▲ | madaxe_again 4 hours ago | parent | prev | next [-] | | Man, there’s an attitude up here in trás-os-montes that the rest of Portugal has spoken unrecognisable trash for a century. It took me years to realise I’d learned hilariously antique Portuguese by moving there. Then again, if you go to Miranda de Douro, they’ll say the rest of Portugal has been talking nonsense for the last 700 years, so the purists at least always have their concents to retreat to if they so choose. | |
| ▲ | philipwhiuk 3 hours ago | parent | prev [-] | | > I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English). That's easy to say when you're not on the other end of US defaultism. | | |
|
|
| ▲ | evandrofisico 28 minutes ago | parent | prev | next [-] |
| Portugal has a growing Xenophobic attitude towards immigrants, specially Brazilians and this is reflected in linguistic prejudice. They have concerns of portuguese children learning to "speak brazillian" because there is a lot more of video content being produced in Brasil than in Portugal and stuff like movies, videogames and software in general are avaliable in brazilian localization/adaptation first. |
| |
| ▲ | embedding-shape 18 minutes ago | parent [-] | | We have the same thing happening, on multiple levels, here too. First some Spanish parents are afraid the children aren't listening and watching enough Spanish media. Then additionally, some Catalan parents are afraid the children don't get to use Catalan in school so they don't become proficient enough to use it in society. | | |
| ▲ | darkwater 3 minutes ago | parent [-] | | The Catalan situation is completely different and unrelated, being a completely different language and not endangered (with or without scary quotes, as you prefer) by an ex-colony that became independent. Actually many Catalans would like to be such ex-colony. |
|
|
|
| ▲ | mghackerlady 3 hours ago | parent | prev | next [-] |
| Right, but most of those speak brazilian portuguese. There's so much less european portuguese text that it becomes impossible for a model to not speak brazilian portuguese if not trained in a way that ignores brazilian sources |
|
| ▲ | KK7NIL 4 hours ago | parent | prev | next [-] |
| The whole point of this project is to have an LLM that speaks European Portuguese, not Brazilian Portuguese. |
| |
| ▲ | embedding-shape 4 hours ago | parent [-] | | Right, and my point is that if you use 80% Brazilian Portuguese during base model training + 20% European Portuguese as post-training, you pretty much get exactly that, except with a ton more of available training data. | | |
| ▲ | KK7NIL 4 hours ago | parent [-] | | What's your evidence for that? And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English or a mixture of languages, which is essentially what they did by starting with EuroLLM? | | |
| ▲ | embedding-shape 4 hours ago | parent [-] | | Evidence? Not so much, I didn't realize I was defending a PhD thesis here. I speak Spanish, and have talked with people who only speak Portuguese, either of the variants, and also talked with Portuguese people before how they see their language, comparing it with Brazilian Portuguese, and vice-versa. So basically based on vibes and experience. > And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences. Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons. | | |
| ▲ | KK7NIL 3 hours ago | parent [-] | | > I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences. I'm a dual citizen of Portugal and Brazil and I live in the US now, so that's my linguistic background. (Also studied bits of French, Russian, Latin and Greek.) > Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons. Not only are your reasons not obvious, your conclusion is actually wrong. If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals), it might actually make more sense to train it in any other language BUT Brazilian Portuguese (say, English), then fine-tune it for European Portuguese. LLM's have shown to be very good at generalizing across languages (the transformer architecture literally comes from work on translators IIRC). | | |
| ▲ | embedding-shape 4 minutes ago | parent [-] | | > If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals) Oh, I wasn't aware that was their goal, would certainly be intuitive to avoid Brazilian Portuguese if that's the case, although I'm still not sure it actually makes sense to 100% avoid it for pre-training even if you're trying to avoid Brazilian bias, you can "skew" things pretty heavily in post-training if you so wish. Where can I read more about this goal, because it doesn't seem to be mentioned in the submission article, just a short off-hand about one of the benchmarks, so I'm guessing there is some resource they talk more about the specifically perhaps? |
|
|
|
|
|
|
| ▲ | madaxe_again 4 hours ago | parent | prev [-] |
| Mutually intelligible, yes, but far from perfectly so. I speak both, as a native anglophone, and the difference is not so much “US vs British English” so much as “Guyanese English vs British English”. Like, fundamental points of grammar differ, the spoken rhythm and syllabic stress differs (poetry does not translate well between them), never mind just vocabulary. Continental Portuguese people tend to find it easier to understand brasileiros than vice versa, largely due to mostly one-way cultural exports, but to try to roll both into a single model would create a creole at best. |
| |
| ▲ | embedding-shape 4 hours ago | parent [-] | | I agree, they're not the same. But they're far closer than other languages who don't come from the same families. |
|