Remix.run Logo
ddulaney 14 hours ago

There's at least one major exception to this: Unicode normalization.

It's possible for the same logical character to have two different sets of code points (for example, a-with-umlaut as a single character, vs a followed by umlaut combining diacritic). Related, distinguishing between the "a" character in Latin, Greek, Cyrillic, and the handful of other times it shows up throughout Unicode.

This comes up in at least 3 ways:

1. A usability issue. It's not always easy to predict which identical-looking variant is produced by different input methods, so users enter the identical-looking characters on different devices but get an account not found error.

2. A security issue. If some of your backend systems handle these kinds of characters differently, that can cause all kinds of weird bugs, some of which can be exploited.

3. An abuse issue. If it's possible to create accounts with the same-looking name as others that aren't the same account, there can be vectors for impersonation, harassment, and other issues.

So you have to make a policy choice about how to handle this problem. The only things that I've seen work are either restricting the allowed characters (often just to printable ASCII) or being very clear and strict about always performing one of the standard Unicode transformations. But doing that transformation consistently across a big codebase has some real challenges: in particular, it can change based on Unicode version, and guaranteeing that all potential services use the same Unicode version is really non-trivial. So lots of people make the (sensible) choice not to deal with it.

But yeah, agreed that parenthesis should be OK.

speleding 4 hours ago | parent | next [-]

Something we just ran in to: There are two UTF-8 codepoints for the @ character, the normal one and "Full width At Sign U+FF20". It took a lot of head scratching to understand why several Japanese users could not be found with their email address when I was seeing their email right there in the database.

teddyh 3 hours ago | parent [-]

There are actually two more: U+FE6B and U+E0040.

tugu77 8 hours ago | parent | prev [-]

[dead]