Remix clone Hacker News

new | show | ask | jobs Github

	▲	vintermann 4 hours ago
		Levenshtein distance is rarely the similarity measure you need. Words usually mean something, and it's usually the distance in meaning you need. As usual, examples from my genealogy hobby: many sites allow you to upload your family tree as a gedcom file and compare it to other people's trees or a public tree. Most of these use Levenshtein distance on names to judge similarity, and it's terrible. Anne Nilsen and Anne Olsen could be the same person, right? No!! These tools are unfortunately useless to me because they give so many false positives. These days, an embedding model is the way to go. Even a small, bad embedding model is better than Levenshtein distance if you care about the meaning of the string.
	▲	jppittma 2 hours ago \| parent [-]
		It depends on if or not you're trying to correct for typos, or do something semantic. Also, embedding distance is much much more expensive.