How do I allow "stępień" while detecting Zalgo-isms?

egypturnash 7 months ago | parent | next [-]

Zalgo is largely the result of abusing combining modifiers. Declare that any string with more than n combining modifiers in a row is invalid.

n=1 is probably a reasonable falsehood to believe about names until someone points out that language X regularly has multiple combining modifiers in a row, at which point you can bump up N to somewhere around the maximum number of combining modifiers language X is likely to have, add a special case to say "this is probably language X so we don't look for Zalgos", or just give up and put some Zalgo in your test corpus, start looking for places where it breaks things, and fix whatever breaks in a way that isn't funny.

▲

ahazred8ta 7 months ago | parent | next [-]

N=2 is common in Việt Nam. (vowel sound + tonal pitch)

▲

anttihaapala 7 months ago | parent [-]

Yet Vietnamese can be written in Unicode without any combining characters whatsoever - in NFC normalization each character is one code point - just like the U+1EC7 LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW in your example.

	▲	cryptonector 7 months ago \| parent [-]
		u/egypurnash's point was about limiting glyph complexity. You could canonically decompose then look for more than N (say, N=3) combining codepoints in a row and reject if any are found. Canonical forms have nothing to do with actual glyph complexity, but conceptually[0] normalizing first might be a good first step. [0] I say conceptually because you might implement a form-insensitive Zalgo detector that looks at each non-combining codepoint, looks it up in the Unicode database to find how many combining codepoints one would need if canonically decomposing and call that `n`, then count from there all the following combining codepoints, and if that exceeds `N` then reject. This approach is fairly optimal because most of the time most characters in most strings don't decompose to more than one codepoint, and even if they do you save the cost of allocating a buffer to normalize into and the associated memory stores.

▲

zvr 7 months ago | parent | prev [-]

I can point out that Greek needs n=2: for accent and breathing.

▲

seba_dos1 7 months ago | parent | prev | next [-]

There's nothing special about "Stępień", it has no combining characters, just the usual diacritics that have their own codepoints in Basic Multilingual Plane (U+0119 and U+0144). I bet there are some names out there that would make it harder, but this isn't one.

	▲	cryptonector 7 months ago \| parent [-]
		If you decompose then it uses combining codepoints. Still nothing special.

▲

KPGv2 7 months ago | parent | prev | next [-]

I could answer your question better if I knew why you need to detect Zalgo-isms.

	▲	account42 7 months ago \| parent [-]
		Because they are an attack vector. They can be used to hide important information as they overflow bounds (can be solved with clipping but then you need to do that everywhere it matters) and have the ability to slow text renderers to a crawl.

▲

dpassens 7 months ago | parent | prev | next [-]

Why do you need to detect Zalgo-isms and why is it so important that you want to force people to misspell their names?

▲

zootboy 7 months ago | parent | prev | next [-]

For the unaware (including myself): https://en.wikipedia.org/wiki/Zalgo_text

If you really think you need to programmatically detect and reject these (I'm dubious), there is probably a reasonable limit on the number of diacritics per character.

https://stackoverflow.com/a/11983435

▲

tobyhinloopen 7 months ago | parent | prev [-]

We have a whitelist of allowed characters, which is a pretty big list.

I think we based it on Lodash’ deburr source code. If deburr’s output is a-z and some common symbols, it passes (and we store the original value)

https://www.geeksforgeeks.org/lodash-_-deburr-method/