Remix.run Logo
egypturnash a day ago

Zalgo is largely the result of abusing combining modifiers. Declare that any string with more than n combining modifiers in a row is invalid.

n=1 is probably a reasonable falsehood to believe about names until someone points out that language X regularly has multiple combining modifiers in a row, at which point you can bump up N to somewhere around the maximum number of combining modifiers language X is likely to have, add a special case to say "this is probably language X so we don't look for Zalgos", or just give up and put some Zalgo in your test corpus, start looking for places where it breaks things, and fix whatever breaks in a way that isn't funny.

ahazred8ta 19 hours ago | parent | next [-]

N=2 is common in Việt Nam. (vowel sound + tonal pitch)

anttihaapala 18 hours ago | parent [-]

Yet Vietnamese can be written in Unicode without any combining characters whatsoever - in NFC normalization each character is one code point - just like the U+1EC7 LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW in your example.

cryptonector 16 hours ago | parent [-]

u/egypurnash's point was about limiting glyph complexity. You could canonically decompose then look for more than N (say, N=3) combining codepoints in a row and reject if any are found. Canonical forms have nothing to do with actual glyph complexity, but conceptually[0] normalizing first might be a good first step.

[0] I say conceptually because you might implement a form-insensitive Zalgo detector that looks at each non-combining codepoint, looks it up in the Unicode database to find how many combining codepoints one would need if canonically decomposing and call that `n`, then count from there all the following combining codepoints, and if that exceeds `N` then reject. This approach is fairly optimal because most of the time most characters in most strings don't decompose to more than one codepoint, and even if they do you save the cost of allocating a buffer to normalize into and the associated memory stores.

zvr 21 hours ago | parent | prev [-]

I can point out that Greek needs n=2: for accent and breathing.