> People have many ways to learn a language, different for each person. Suppose you wanted to improve your vocabulary by reading books in that language. To get the most impact, you’d like to pick books that cover as many common words in the language as possible.

I think the article is just using this as a hook to introduce the submodularity of the maximum weighted cover problem. But I'll talk about a different way of using the same collection of books to learn a language that I think is better.

First of all, you'll probably want to take into account which words you already know, instead of just removing stopwords. If a book uses lots of common words, but you already know them, you're not learning much.

Secondly, no matter how much or how little you already know, you're unlikely to find a book that fits your level well. If you're just beginning to learn the language, no matter which book you pick, the very first sentence will be full of new words, but most of those will be rare ones that you won't encounter again until much later. If on the other hand you already have a very good command of the language, you might be able to breeze through entire chapters and only pick up a handful of new words. (If your primary goal is to enjoy books rather than achieving mastery of the language, this is of course perfectly fine.)

So what I do is split the entire collection into sentences, and for each word from most common to least, pick a small number of sentences using this word, ideally without also having much rarer words, try to read and understand them all, and then use the most suitable sentence to make an Anki flashcard. It's much easier to find a sentence at the right level than an entire book.

It can be a bit weird to learn about the plot of a book piecemeal out of order, especially if multiple books are mixed together, but I think it's an interesting experience.

The same principle can also be applied to recordings from Mozilla Common Voice: https://commonvoice.mozilla.org/en/datasets I like to use them for dictation exercises in Anki, where the card plays a recording and I type in what I thought I heard to check whether I got it right.

▲ monkeywork 4 days ago | parent [-]

do you have an automated method of doing the filtering or is this all manual

▲ yorwba 4 days ago | parent [-]

The sorting is automated.

  word_count = Counter(w for s in sentences for w in words(s))

  sentences_by_word = defaultdict(list)
  for s in sentences:
    for w in words(s):
      sentences_by_word[w].append(s)

  sentence_sort_key = lambda s: sorted(word_count[w] for w in set(words(s)))

  for w, _ in word_count.most_common():
    candidates = sorted(sentences_by_word[w], key=sentence_sort_key, reverse=True)[:5]
    for c in candidates:
      print(w, ':', c)
    input()

(Add epicycles for defining what a word is, what a sentence is, ensure the candidate sentences have varying lengths, keep track of which words and sentences were already seen...)

The final step of choosing one sentence and turning it into an Anki flashcard is manual.