Remix clone Hacker News

new | show | ask | jobs Github

	▲	sminchev 2 hours ago
		A few years ago, before the AI boom I needed to create a de-duplication app, as a PoC. To be able to compare fast millions of contact data and to search for the duplicates. The clients' approach was taking, in best case, a day to compare everything and generate a report. What we do was a combination of big data engine, like Apache Spark, a few comparison algorithms like Levenshtein, and ML. AI was not treated as an option to do such things at all! :) What we did was to use Apache Spark to apply the static algorithms, if we get confident results like less than 10% equality or more than 90% of equality, we treated those as sure signs for records be duplicated or not. Records that were somewhere in the middle, we sent to Machine Learning libraries for analysis. Of course some education was needed for statistical basis. And hard to be automatically analyzed, we placed in a report for human touch ;) We got relatively good results. It was a Scala based app, as far as I remember :) Now with AI, it is much more easy... And boring! :D No complexities, no challenges.
	▲	arnorhs 2 hours ago \| parent [-]
		That's an interesting story, but I'm really at a loss for how this relates to the post you are commenting on.