Remix.run Logo
Learning Languages with the Help of Algorithms(johndcook.com)
80 points by ibobev 7 days ago | 41 comments
xg15 3 days ago | parent | next [-]

The language learning premise in this post is a bit ridiculous - if I started with the goal of learning a language and ended up worrying about the asymptotic complexity of my automated k-book recommendation algorithm for arbitrary values of k, then I think I should worry about a serious case of procrastination.

But the algorithms are interesting, so I think a better title would have been "why submodular NP hard problems are cool" or something similar.

vunderba 3 days ago | parent | next [-]

Agreed - it's a bit of a ridiculous premise. Honestly you'd be better served picking up some proper Graded Readers [1] in the foreign language.

[1] https://tadoku.org/japanese/en/graded-readers-en

cjohnson318 3 days ago | parent | prev | next [-]

The thing about language is that words have a weird distribution. The most common 100 words show up in every single sentence, but then tons of "common" words show up statistically almost never. Like, "octopus" is a common word that is only going to be useful if you're talking to a marine biologist, or a three year old that's obsessed with octopuses, otherwise you're hardly ever going to use that word. There's a lot of words like that. "Spine" of a book? It's probably not "spine" in your target language.

crossroadsguy 3 days ago | parent | prev [-]

How would one go about dealing with that kind of procrastination? Or is it not handling distraction?

pessimizer 3 days ago | parent | next [-]

https://hillaryrettigproductivity.com/the-seven-secrets-of-t...

"Procrastination, perfectionism and writer's block are not moral flaws; nor are they caused by laziness, lack of discipline or lack of commitment. They are habits rooted in fear and scarcity - and the great news is that once we start alleviating our fears and resourcing ourselves abundantly, our procrastination and related problems are often remarkably easily solved."

It's directed at writers, but it's really for all perfectionists.

xg15 3 days ago | parent | prev [-]

Well, I'm sure you could build an amazing anti-procrastination app that has pluggable anti-procrastination strategies and uses the multi-armed bandit algorithm, as well as an RL-trained RNN to discover your personal, optimal schedule of anti-procrastination interventions, while automatically prompting an LLM to devise new strategies as soon as the old ones begin to lose their effectiveness and also giving you an option to post your anti-procrastination progress online and watch the anti-procrastination achievements of your friends or invite them to go on a virtual anti-procrastination quest together...

defvar 3 days ago | parent | prev | next [-]

This reminds me of "Thousand Character Text"(千字文), which is a Chinese poem that has been used as a primer for teaching Chinese characters to children from the sixth century onward. It contains exactly one thousand characters, each used only once, arranged into 250 lines of four characters apiece and grouped into four line rhyming stanzas to facilitate easy memorization.

See Also: https://en.wikipedia.org/wiki/Thousand_Character_Classic

srean 3 days ago | parent | prev | next [-]

My language learning problem is slightly different and quite under-served because my motivation is not the most common one.

I want to learn so that I can read/understand publications in mathematics in a foreign language, mostly Swedish, French, German. (*) For this exercise, the typical apps do not help much.

(*) I would have liked to add Latin and Greek too but that's mostly a pipe dream.

Reading old mathematicians and scholars I have realized something that runs quite counter to the common perception we have, especially in my country.

That common perception is school kids, especially in mid and high school are overwhelmingly burdened by sheer volume of subject matter to learn at a very young age. But then I look at educated teenagers from 17th - 18th centuries, who went on to become mathematicians or scholars, they were so immensely well read at a very young age. I understand this is a biased sample, but many of these people, Newton, for example, were ordinary folks (socio-economically speaking)

Hamilton (I concede that one cannot compare Hamilton with a typical modern teenager) was already quite fluent in thirteen languages in his pre-teens. Apart from the usual suspects, he knew Arabic, Hebrew, Farsi, Sanskrit, Hindi, Marathi.

This might sound atypical but this was not unheard of. One of the poets in my language was fluent in Hebrew, Greek, Italian, French, Latin, Sanskrit, Telugu, Tamil, Bengali, English.

jwrallie 3 days ago | parent | next [-]

That's quite different than what most folks are looking for when learning a new language, but I guess some techniques can apply just fine. You would have to take the lead find and prepare your own study material.

Something like collecting phrases from these books, loading them into SRS, collecting youtube videos of natives discussing the material you are into, extracting the sound and listening several hours of it for immersion... That is basically the way I learn but focusing on different material.

With LLMs, it is much easier to create your own study material nowadays, as you can ask to translate, break down and explain things as you go.

credit_guy 3 days ago | parent | prev | next [-]

I find your problem to be much easier than the problem faced by a typical language learner. Mathematics literature uses quite a small subset of a language, and in many languages math authors use a lot of equations, which are quite language agnostic.

One way to tackle this problem is to just get started with an LLM. You ask ChatGPT for example, to translate for you, and then you try to figure out what word correspond to what word and keep going. After a while you will need the LLM help less and less.

Who am I to tell you this? I only read math in my native language, and in French and English. But once I wanted to do some calculations using Gauss's Theorema Egregium, so, out of curiosity, I picked up both the English translation of Gauss's original publication, and the original Latin text. I was able to understand sufficient Latin to figure out what Gauss was saying and to find out that the English translation has a bug.

srean 2 days ago | parent [-]

Great to know. Many here suggested LLMs. I have not tried them before for this problem. I assumed they wouldn't work well with old languages that do not dominate the web.

DyslexicAtheist 3 days ago | parent | prev | next [-]

what makes Latin difficult in your context? My focus isn't Math and fwiw found many very good, free, entry-level[1] self-study[2] books (Hans Orberg and others), and even Latin podcasts. There is even a fun Latin track on some of the popular language learning apps.

[1] https://archive.org/details/conspectus-grammaticus-familia-r...

[2] https://latinitium.com/best-books-for-learning-latin/

srean 3 days ago | parent [-]

Thanks for the links.

As for difficulty, well, even English is not my first language. So Latin would be quite a stretch for me.

What makes things more difficult (this is not specific to Latin) is that Maths, Physics has its own language. Domain specific words, such as curvature, torsion, divergence, curl, force, power, action, moment, momentum do not translate in a way that is linguistically obvious.

vunderba 3 days ago | parent | prev | next [-]

Good luck! This kind of reminds me of how Bobby Fischer at a relatively young age learned Russian for the explicit purpose of being able to keep up with the best Chess manuals and periodicals - a great deal of which were coming out of the Soviet Union.

agentcoops 3 days ago | parent | prev | next [-]

I understand both your historical question and the more concrete practical one. Separating reading comprehension as a skill from all the other discrete functions of a language is very straight-forward [1] and, in fact, there are some good analog resources for this.

For French: Dandberg and Tatham, French for Reading

For German: Jannach, German for Reading Knowledge

I've used both and swear they're magic, especially if you're trying to learn to read in a scientific domain that you're already a specialist in (versus literature).

Once you've sort of "learned the game" it isn't very hard to do a similar process for other languages on your own. Then, my main recommendation is to take a text you're deeply familiar with in your native language or English that exists in X other language and just go ahead and start reading it with a dictionary. It starts slow, but progress is very very fast if you stick with it, especially compared to learning to speak or even just listen to a language.

For life reasons, I've found myself having to learn Danish, so I'll let you know if I figure out any good resources for Scandinavian languages.

[1] The only downside I've encountered is trying to later learn to speak a language I had been reading for a while where overcoming the sort of "fictitious phonetics" that existed in my head proved problematic.

vidarh 3 days ago | parent | next [-]

My last year of German, I brought it up a grade by reading Faust in parallel in German and a century old Danish translation... I'm Norwegian, and the old Danish translation was a decent mid-point between Norwegian and German to let me get through the German without having to resort to the dictionary very often.

I think, for Danish, if your German is decent, look to older, more formal Danish books you can also find in German, or maybe try to find work in both Danish and Low German / Plattdeutsch and see if it forms a good midpoint for you.

Dutch might possibly also form a decent parallel - the combination of my Norwegian, German and English means I can slog my way through more formal Dutch reasonably well without ever having tried to learn it.

agentcoops 3 days ago | parent [-]

Thanks for the tips! It's really interesting going back to that point of indistinction in these languages where even the proximity to old English is so strong. I've been hoping to try and read some of the old sagas in the process, but that's probably a bit too far back -- I'll have to see if there are some good century old translations thereof, as you suggest.

The phonetics, on the other hand, are presenting some challenges...

vidarh 3 days ago | parent [-]

Yeah, the phonetics can be far harder. I struggle to even make out discrete words in spoken Dutch despite finding it quite easy to read.

For Danish, it's so similar to Norwegian it's a lot easier, but there's an old Norwegian joke that Danish is just Norwegian spoken with a potato in your mouth... To us, Danish sounds like they're failing to enunciate every single sound...

Incidentally, pronunciation got a lot easier to me when I started looking at mouth placement of natives when speaking. Just watching and copying mouth placement and movements have fixed so many pronunciation issues for me that no amount of listening and repeting could address.

srean 3 days ago | parent | prev | next [-]

That's very encouraging thanks. I hear that Norwegian, Swedish isn't very hard for an English language speaker. All the best for your next language.

Apparently I was good at picking up languages other than my mother tongue, as a child (4yrs). But now those same languages that I apparently was fluent in appear quite incomprehensible, like first contact incomprehensible.

agentcoops 3 days ago | parent [-]

Yeah, language learning is such a wild thing. I have a five month old son now and I'm just getting to see the process of language acquisition firsthand. He'll be going to early school (eventually) in Danish, but probably won't end up living here forever, so curious if it'll stick with him through life or not.

What's your mother tongue out of curiosity?

Good luck with your studies!

srean 2 days ago | parent [-]

It's Bengali.

bawis 3 days ago | parent | prev [-]

Are there other books in other languages with the same idea (reading comprehension) ? do you think they are worth reading even for readers not interested in those specific languages but in learning techniques to apply ?

agentcoops 3 days ago | parent [-]

I'm sure there are, but I can personally attest to the quality of those two. The French one in particular is just astonishingly well-executed that I would recommend looking at it if you were interested in techniques. The "magic" of it is in the composition: each chapter, you read a little description of a new grammatical concept, you work through various related sentences with help, and suddenly you're reading a whole French text composed of those past sentences and able to answer comprehension-related questions. It just builds and builds like that to great complexity.

What always makes learning to read easier is that time is completely in your control. The principle is pretty straightforward: if you have enough time and patience, you can read anything (with a dictionary, grammar book etc) and the more you read in that language the less time it starts to take. These books basically just bootstrap the process.

I mentioned it above, but the other way is if you have a book or article you really know well in your mother tongue that exists in a language you want to learn, just patiently try and read it in that language. I think programmers actually have a bit of an advantage in this, as it's really just pattern recognition -- and it isn't that different from trying to understand a program in a language you haven't worked with before.

bawis 3 days ago | parent | prev [-]

Interesting, you are learning languages for math publications but you didn't include Russian ? unless of course you are native (or you are from ex soviet)

srean 3 days ago | parent [-]

That's a great point because a lot of the maths literature I am interested in is actually in Russian (optimization, probability).

Thankfully there is the "Translations of Mathematical Monographs" book series

https://bookstore.ams.org/mmono

I had just resigned myself to the fact that I will probably never be able learn Russian. At an optimistic best, perhaps French and Swedish only, if at all.

DiskoHexyl 2 days ago | parent | prev | next [-]

In every hobby there are 2 groups of people: those who enjoy the actual act of doing the thing, and those who enjoy the tooling (equipment, methodologies, discussions around the thing etc).

Not saying that one is inherently more worthy than the other, but no surprise- the first group is usually better at actually _doing_ the thing

manx 3 days ago | parent | prev | next [-]

I fully agree with this approach! 5 years ago I built a prototype to execute that same concept of language covering. But instead of just using words, I used n-grams. It ii trained on subtitles to model spoken language. Combined with sqlite in the browser to get the next sentence with the most impact.

github here: https://github.com/fdietze/ravioli

prototype deployed here: https://raviolio.web.app/

yorwba 3 days ago | parent | prev | next [-]

> People have many ways to learn a language, different for each person. Suppose you wanted to improve your vocabulary by reading books in that language. To get the most impact, you’d like to pick books that cover as many common words in the language as possible.

I think the article is just using this as a hook to introduce the submodularity of the maximum weighted cover problem. But I'll talk about a different way of using the same collection of books to learn a language that I think is better.

First of all, you'll probably want to take into account which words you already know, instead of just removing stopwords. If a book uses lots of common words, but you already know them, you're not learning much.

Secondly, no matter how much or how little you already know, you're unlikely to find a book that fits your level well. If you're just beginning to learn the language, no matter which book you pick, the very first sentence will be full of new words, but most of those will be rare ones that you won't encounter again until much later. If on the other hand you already have a very good command of the language, you might be able to breeze through entire chapters and only pick up a handful of new words. (If your primary goal is to enjoy books rather than achieving mastery of the language, this is of course perfectly fine.)

So what I do is split the entire collection into sentences, and for each word from most common to least, pick a small number of sentences using this word, ideally without also having much rarer words, try to read and understand them all, and then use the most suitable sentence to make an Anki flashcard. It's much easier to find a sentence at the right level than an entire book.

It can be a bit weird to learn about the plot of a book piecemeal out of order, especially if multiple books are mixed together, but I think it's an interesting experience.

The same principle can also be applied to recordings from Mozilla Common Voice: https://commonvoice.mozilla.org/en/datasets I like to use them for dictation exercises in Anki, where the card plays a recording and I type in what I thought I heard to check whether I got it right.

monkeywork 3 days ago | parent [-]

do you have an automated method of doing the filtering or is this all manual

yorwba 3 days ago | parent [-]

The sorting is automated.

  word_count = Counter(w for s in sentences for w in words(s))

  sentences_by_word = defaultdict(list)
  for s in sentences:
    for w in words(s):
      sentences_by_word[w].append(s)

  sentence_sort_key = lambda s: sorted(word_count[w] for w in set(words(s)))

  for w, _ in word_count.most_common():
    candidates = sorted(sentences_by_word[w], key=sentence_sort_key, reverse=True)[:5]
    for c in candidates:
      print(w, ':', c)
    input()
(Add epicycles for defining what a word is, what a sentence is, ensure the candidate sentences have varying lengths, keep track of which words and sentences were already seen...)

The final step of choosing one sentence and turning it into an Anki flashcard is manual.

josefrichter 3 days ago | parent | prev | next [-]

I am fiddling with some language learning utilities myself. Can anyone recommend some relatively simple ways of tracking users' knowledge of a given language? Something like having a sorted frequency list of words/phrases/concepts, and tracking how many times each word has been seen vs used correctly vs used incorrectly, etc.?

I believe this does not have to be perfect, simplicity is preferred. But it should be just enough for an LLM to take a glimpse and estimate users' level in given language.

kiru_io 3 days ago | parent | next [-]

I developed a few utilities to help me track the words and expressions I know in a language (and also see which words in other languages are missing). Tried to port it to an app [0], but it's not perfect yet.

[0] https://apps.apple.com/us/app/ai-anki-learning-fluentread/id...

rsanek 20 hours ago | parent [-]

Careful using the Anki name, the original author of the app recently registered a trademark.

> Anki is a registered trademark of Ankitects Pty Ltd.

https://apps.ankiweb.net/

fny 3 days ago | parent | prev [-]

You can give them a series of questions from hardest to easiest and based on where they fail according to your metric you place them.

pessimizer 3 days ago | parent | prev | next [-]

This is a bad way to go about it. You want to consume more material, and you want each piece of it to have the least impact on your vocabulary.

So maybe looking for high frequency words is good, but only high frequency words that you know. So the most coverage of the most high frequency words would be very bad. To get the most coverage of the most high frequency words, they'd have to be used in a lower frequency than they are normally, with less repetition in natural contexts, which enable the learner to build meaning. Unless the books were longer which makes degenerate the concept of concentrating common vocabulary in very few books (just read two 2000 page books!)

Reading a bunch of stuff with a concentrated dose of tons of words you don't know will leave you with absolutely no retention. If you know every word but one in e.g. a chapter, you'll probably remember that word forever. The concept is called comprehensible input - you set unfamiliar things in a background of familiar things.

If you want a book with the most unfamiliar vocabulary, it's called a dictionary. It contains all of the most commonly used words, and the least commonly used, too.

In fact, maybe this makes sense if you're going to be locked in a cell for 10 years, you want to learn a language starting from zero*, and only get to have a pocket dictionary and two other books (with a size limit.) You might want to have sample natural sentences for as many of the best words to know as you could.

The real algorithmic language learning trick is to write books that are interesting that use the fewest words (which would inevitably be the most important words to use to communicate but not the necessarily the most common words that natives use to communicate), and introduce new, useful words at a steady rate. That seems like how Capretz put together French in Action. It's also graded readers: I still remember the moment I realized that I could not only understand what was happening in the basic graded reader I'd accidentally picked up on a whim, but also I was interested in finding out what was going to happen next. It's been downhill from there.

-----

[*] or maybe from one? You would have to have some familiarity with the script, and it had better be a phonetic one. Otherwise, this would be just learning how to read a language. No English, no French, no Portuguese, no Chinese... although having poetry books might help, because you can be surer of vowel similarities and syllable breaks. Poetry books are not dense, however, and might bump against any size limit. And the vocabulary would be weird and not representative.

ipnon 4 days ago | parent | prev [-]

There are many apps that have utilized formal methods in an attempt to teach languages as optimally as possible. But Duolingo is still the leader in language learning. Why? Language learning is an emotional process. Every word you can bring to mind likely has some specific memories tied to them, from another time and place. So even though Duolingo is far from optimal in terms of how and when to present new items to learn, it is close to optimal in vibes, and apparently in the market of language learning this is what consumers prioritize over all else. I believe it is for good reason. Whoever displaces Duolingo will do so not because they teach more efficiently, but because they improve on embedding particular emotions and sentiments into the lessons.

xdfgh1112 4 days ago | parent | next [-]

Duolingo isn't even language learning. It's closer to tiktok, it produces dopamine without actually teaching very much at all.

Turns out that most consumers just want to feel like they're learning a language instead of doing the actual work, or in extreme cases, literally only care about maintaining their streak or leaderboard score.

xg15 3 days ago | parent [-]

Agreeing with you that Duolingo seems more like a nudging/psychological manipulation testbed with a thin veneer of language learning on top to provide legitimacy.

But what makes you think that this is because "most consumers just want" it that way? The whole effect of dopamine hits is to manipulate what users believe they "want". But you cannot claim to be working in the interests of your users after you manipulated them.

I.e. if a user installed Duolingo because they genuinely wanted to learn the language and than got sidetracked by all the gamification stuff, I don't think you can say they "really" just wanted to play games the whole time.

(Duolingo is walking a fine line here, which was probably the reason they picked language learning in the first place: Because in that field, users really do want a certain degree of nudging and manipulation, to help them keep up with the tedious process of frequent repetition.

That was sort of the official value proposition if Duolingo and I think the reason why many users installed it. It's also why many of the nudging strategies work at all, because they can assume a cooperating user.

But if you use the app, you can see that it frequently tries to push beyond that mutually agreed purpose: Trying to upsell you to the paid version, invite friends, take part in global leaderboard challenges, etc - all of which has very little to do with language learning)

adastra22 4 days ago | parent | prev | next [-]

You are making the mistake of assuming that the largest market / largest user base app is also doing the most language instruction.

Duolingo is one of the worst apps out there for language learning, and its users are not practicing useful language skills. It’s a gamified system that feels like language learning, without actually having any substance.

some_guy_nobel 4 days ago | parent | prev | next [-]

Or it just requires the lowest effor. Or it is the most gamified language learning app.

Or ...

wiseowise 3 days ago | parent | prev [-]

Nonsense. By what metrics do you consider it “the leader”? Popularity forced by marketing? I don’t know a single serious language learner that swears by Duolingo. My gf, who spent at least 100 days on Duolingo, migrated to Babbel.