Remix clone Hacker News

new | show | ask | jobs Github

	▲	yorwba 5 hours ago
		1. Take a large collection of text, e.g. from https://opus.nlpl.eu/corpora-search/zh-CN&en 2. Split into sentences and tokenize sentences into words, e.g. using https://github.com/fxsjy/jieba 3. Count how often each word appears and sort sentences by descending frequency of the least common word. 4. Use binary search to find a location in the sorted collection of sentences where the difficulty feels about right. Of course this gives you a collection of disjointed sentences, but you can always go to the original file and look at the surrounding context when you find an interesting or confusing one.