▲ | cabirum a day ago | |||||||||||||||||||||||||||||||
How do I allow "stępień" while detecting Zalgo-isms? | ||||||||||||||||||||||||||||||||
▲ | egypturnash a day ago | parent | next [-] | |||||||||||||||||||||||||||||||
Zalgo is largely the result of abusing combining modifiers. Declare that any string with more than n combining modifiers in a row is invalid. n=1 is probably a reasonable falsehood to believe about names until someone points out that language X regularly has multiple combining modifiers in a row, at which point you can bump up N to somewhere around the maximum number of combining modifiers language X is likely to have, add a special case to say "this is probably language X so we don't look for Zalgos", or just give up and put some Zalgo in your test corpus, start looking for places where it breaks things, and fix whatever breaks in a way that isn't funny. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | seba_dos1 a day ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
There's nothing special about "Stępień", it has no combining characters, just the usual diacritics that have their own codepoints in Basic Multilingual Plane (U+0119 and U+0144). I bet there are some names out there that would make it harder, but this isn't one. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | KPGv2 a day ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
I could answer your question better if I knew why you need to detect Zalgo-isms. | ||||||||||||||||||||||||||||||||
▲ | dpassens 19 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
Why do you need to detect Zalgo-isms and why is it so important that you want to force people to misspell their names? | ||||||||||||||||||||||||||||||||
▲ | tobyhinloopen 18 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
We have a whitelist of allowed characters, which is a pretty big list. I think we based it on Lodash’ deburr source code. If deburr’s output is a-z and some common symbols, it passes (and we store the original value) | ||||||||||||||||||||||||||||||||
▲ | zootboy a day ago | parent | prev [-] | |||||||||||||||||||||||||||||||
For the unaware (including myself): https://en.wikipedia.org/wiki/Zalgo_text If you really think you need to programmatically detect and reject these (I'm dubious), there is probably a reasonable limit on the number of diacritics per character. |