> text obfuscation against LLM scrapers

Nice! But we already filter this stuff before pretraining.

Including RTL-LTR flips, character substitutions etc? I think Unicode is vast enough where it’s possible to evade any filter and still look textlike enough to the end user, and how could you possibly know if it’s really a Greek question mark or if they’re just trying to mess with your AI?

	▲	zahlman an hour ago \| parent \| next [-]
		I assume that anyone trying to "filter" the text could just render it and then OCR it.
	▲	Sabinus 12 hours ago \| parent \| prev [-]
		Ultimately the AI will just learn those tokens are basically the same thing. You'll just be reducing the learning rate by some (probably tiny) amount.