Remix.run Logo
yorwba 5 hours ago

1. Take a large collection of text, e.g. from https://opus.nlpl.eu/corpora-search/zh-CN&en

2. Split into sentences and tokenize sentences into words, e.g. using https://github.com/fxsjy/jieba

3. Count how often each word appears and sort sentences by descending frequency of the least common word.

4. Use binary search to find a location in the sorted collection of sentences where the difficulty feels about right.

Of course this gives you a collection of disjointed sentences, but you can always go to the original file and look at the surrounding context when you find an interesting or confusing one.