Ask HN: Better approach for plagiarism detection in self-hosted LMS?

	▲	Ask HN: Better approach for plagiarism detection in self-hosted LMS?
		1 points by pigon1002 8 hours ago
		I'm building an open-source LMS and added plagiarism detection using OpenSearch's more_like_this query plus character n-grams for similarity scoring. Basically when a student submits an answer, I search for similar answers from other students on the same question. Works decently but feels a bit hacky - just reusing the search engine I already had. Current setup: search = cls.search().filter( "nested", path="answers", query={"term": {"answers.question_id": str(question_id)}} ) search = search.query( "nested", path="answers", query={ "more_like_this": { "fields": ["answers.answer"], "like": text, "min_term_freq": 1, "minimum_should_match": "1%", } }, ) # get top 10, then re-rank in Python def normalize(t): return re.sub(r"\s+", "", t.strip()) def char_ngrams(t, n=3): return set(t[i:i+n] for i in range(len(t)-n+1)) norm_text = normalize(text) text_ngrams = char_ngrams(norm_text) for hit in response.hits: norm_answer = normalize(hit.answer) answer_ngrams = char_ngrams(norm_answer) intersection = len(text_ngrams & answer_ngrams) union = len(text_ngrams \| answer_ngrams) ratio = int((intersection / union) * 100) if ratio >= 60: # flag as similar Constraints: - Self-hosted only, no external APIs - Few thousand students - Want simple operations, already running OpenSearch anyway Questions: - Is this approach reasonable or am I missing something obvious? - What do other self-hosted systems use? Checked Moodle docs but their plagiarism plugins mostly call external services - Anyone tried lightweight ML models for this that don't need GPU? The search engine approach works but curious if there's a better way that fits our constraints.