Remix clone Hacker News

new | show | ask | jobs Github

	▲	logicprog 7 hours ago
		For LLM scrapers, it doesn't even matter if LLMs would be able to understand the raw text or not because it's extremely easy to just strip junk unicode characters. It's literally a single regex, and, like, that kind of sanitization regex is something they should already be using, and that I'd use by default if I were writing one.
	▲	layer8 5 hours ago \| parent [-]
		There are no “junk” Unicode characters. There are just nonsensical combinations of characters. Stripping out characters blindly is not a solution, because you have no way of knowing what was intended.