Remix.run Logo
numpad0 3 days ago

What system take UTF-8 for usernames? Everyone knows that all programmatically manipulated and/or evaluated identifiers including login usernames and passwords need to be in ASCII - not even ISO-8859-1, just plain old ASCII. Unicode generally don't work for those purposes. Username as in friendly display strings is fine, but for username as in system login, the entire non-ASCII encoding is a no go.

I mean, I don't even know my keyboard software is consistent in UTF-8 for the exact same intended visual representation outside of ASCII range, let alone across different operating systems and configurations, or over time. Or vice versa; the binary I would leave behind in time to consistently correspond to future Unicode interpretation AIs.

... speaking of consistency, neither the article nor RFC 9839 don't mention IVS situations or NFC/NFD/NFKC/NFKD regularizations problem as explicitly in or out of scope. Overall it feels like this RFC is missing the entire "Purpose" section except there is vague notion of there being non-character code points.

zzo38computer 2 days ago | parent | next [-]

For passwords, you might not need to care about the character encoding, since they are not going to be displayed anyways. You should allow any password, and the maximum length should not be too short.

For usernames, I think your point is valid; you might restrict usernames to a subset of ASCII (not arbitrary ASCII; e.g. you might disallow spaces and some punctuations), or use numeric user IDs, while the display name might be less restricted. (In some cases (probably uncommon) you might also use a different character set than ASCII if that is desirable for your application, but Unicode is not a good way to do it.)

(I also think that Unicode is not good; it is helpful for many applications to have i18n (although you should be aware what parts should use it and what shouldn't), but Unicode is not a good way to do it.)

numpad0 2 days ago | parent [-]

> For passwords, you might not need to care about the character encoding, since they are not going to be displayed anyways.

That would be reasonable if there were strict 1:1 correspondence between intended text and binary representations, but there isn't. Unicode has equivalents of British and American spellings, and users has no control over which to use. Precomposed vs Combining characters, Variant Selectors, etc. Ensuring it all regularize into canonical password string as developer obligation is unreasonable, and just falling back to ASCII is much more reasonable.

I guess everyone using alphanumeric sequences for every identifiers is somewhat imperialistic in a sense, but it's close to the least controversial of general cultural imperialism problems. It's probably okay to leave it to be solved for a century or two.

Timwi 3 days ago | parent | prev [-]

This is such a provincial attitude, wanting to prohibit people from using perfectly normal names like Amélie or Jürgen or Ольга to log in just because you as a programmer can't be bothered to deal with a numerical ID instead.

numpad0 3 days ago | parent | next [-]

You can have such names as Amélie or Jürgen or Ольга or 𠮷野 or 鎮󠄁 and have them displayed on account management screens, you just can't use it for login IDs because there are no guarantees that those blobs can be reproduced in the future or be consistent with what it was at time of entry.

Unicode is that bad.

account42 2 days ago | parent | prev [-]

No, it's an entirely practical attitude, unlike outrage culture.