Remix.run Logo
cabirum a day ago

How do I allow "stępień" while detecting Zalgo-isms?

egypturnash a day ago | parent | next [-]

Zalgo is largely the result of abusing combining modifiers. Declare that any string with more than n combining modifiers in a row is invalid.

n=1 is probably a reasonable falsehood to believe about names until someone points out that language X regularly has multiple combining modifiers in a row, at which point you can bump up N to somewhere around the maximum number of combining modifiers language X is likely to have, add a special case to say "this is probably language X so we don't look for Zalgos", or just give up and put some Zalgo in your test corpus, start looking for places where it breaks things, and fix whatever breaks in a way that isn't funny.

ahazred8ta 19 hours ago | parent | next [-]

N=2 is common in Việt Nam. (vowel sound + tonal pitch)

anttihaapala 19 hours ago | parent [-]

Yet Vietnamese can be written in Unicode without any combining characters whatsoever - in NFC normalization each character is one code point - just like the U+1EC7 LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW in your example.

cryptonector 16 hours ago | parent [-]

u/egypurnash's point was about limiting glyph complexity. You could canonically decompose then look for more than N (say, N=3) combining codepoints in a row and reject if any are found. Canonical forms have nothing to do with actual glyph complexity, but conceptually[0] normalizing first might be a good first step.

[0] I say conceptually because you might implement a form-insensitive Zalgo detector that looks at each non-combining codepoint, looks it up in the Unicode database to find how many combining codepoints one would need if canonically decomposing and call that `n`, then count from there all the following combining codepoints, and if that exceeds `N` then reject. This approach is fairly optimal because most of the time most characters in most strings don't decompose to more than one codepoint, and even if they do you save the cost of allocating a buffer to normalize into and the associated memory stores.

zvr 21 hours ago | parent | prev [-]

I can point out that Greek needs n=2: for accent and breathing.

seba_dos1 a day ago | parent | prev | next [-]

There's nothing special about "Stępień", it has no combining characters, just the usual diacritics that have their own codepoints in Basic Multilingual Plane (U+0119 and U+0144). I bet there are some names out there that would make it harder, but this isn't one.

cryptonector 16 hours ago | parent [-]

If you decompose then it uses combining codepoints. Still nothing special.

KPGv2 a day ago | parent | prev | next [-]

I could answer your question better if I knew why you need to detect Zalgo-isms.

dpassens 19 hours ago | parent | prev | next [-]

Why do you need to detect Zalgo-isms and why is it so important that you want to force people to misspell their names?

tobyhinloopen 18 hours ago | parent | prev | next [-]

We have a whitelist of allowed characters, which is a pretty big list.

I think we based it on Lodash’ deburr source code. If deburr’s output is a-z and some common symbols, it passes (and we store the original value)

https://www.geeksforgeeks.org/lodash-_-deburr-method/

zootboy a day ago | parent | prev [-]

For the unaware (including myself): https://en.wikipedia.org/wiki/Zalgo_text

If you really think you need to programmatically detect and reject these (I'm dubious), there is probably a reasonable limit on the number of diacritics per character.

https://stackoverflow.com/a/11983435