Remix.run Logo
zarzavat a day ago

Presumably there aren't any people with control characters in their name, for example.

cobbzilla a day ago | parent | next [-]

Watch as someone names themselves the bell character, “^G” (ASCII code 7) [1]

When they meet people, they tell them their name is unpronounceable, it’s the sound of a PC speaker from the late 20th century, but you can call them by their preferred nickname “beep”.

In paper and online forms they are probably forced to go by the name “BEL”.

[1] https://en.wikipedia.org/wiki/Bell_character

emmelaich 19 hours ago | parent | next [-]

Or Derek <wood dropping on desk>

https://www.youtube.com/watch?v=hNoS2BU6bbQ

Polizeiposaune 16 hours ago | parent | next [-]

The interaction brings to mind Grzegorz Brzęczyszczykiewicz:

https://www.youtube.com/watch?v=AfKZclMWS1U

(from the Polish comedy film "How I Unleashed World War II")

pavel_lishin 18 hours ago | parent | prev [-]

I thought this was going to be a link to the Key & Peele sketch: https://youtu.be/gODZzSOelss?t=180

Izkata 10 hours ago | parent | prev | next [-]

It's not exactly a bell, but there are clicks: https://en.wikipedia.org/wiki/Click_consonant

https://www.reddit.com/r/Damnthatsinteresting/comments/1614k...

RobotToaster 7 hours ago | parent | prev [-]

I can finally change my name to something that represents my personality: ^G^C

https://en.wikipedia.org/wiki/End-of-Text_character

ValentinA23 a day ago | parent | prev | next [-]

คุณ สมชาย

This name, "คุณสมชาย" (Khun Somchai, a common Thai name), appears normal but has a Zero Width Space (U+200B) between "คุณ" (Khun, a title like Mr./Ms.) and "สมชาย" (Somchai, a given name).

In scripts like Thai, Chinese, and Arabic, where words are written without spaces, invisible characters can be inserted to signal word boundaries or provide a hint to text processing systems.

Saigonautica 12 hours ago | parent | next [-]

The reminds me of a few Thai colleagues who ended up with a legal first name of "Mr." (period included), probably as a result of this.

Buying them plane tickets to attend meetings and so on proved fairly difficult.

pwdisswordfishz a day ago | parent | prev [-]

But C0 and C1 control codes are out, probably.

lmm 14 hours ago | parent | prev | next [-]

> Presumably there aren't any people with control characters in their name, for example.

Of course there are. If you commit to supporting everything anyone wants to do, people will naturally test the boundaries.

The biggest fallacy programmers believe about names is that getting name support 100% right matters. Real engineers build something that works well enough for enough of the population and ship it, and if that's not US-ASCII only then it's usually pretty close to it.

pwdisswordfishz a day ago | parent | prev | next [-]

Or unpaired surrogates. Or unassigned code points. Or fullwidth characters. Or "mathematical bold" characters. Though the latter two should be probably solved with NFKC normalization instead.

chrismorgan 13 hours ago | parent [-]

> Or unpaired surrogates.

That’s just an invalid Unicode string, then. Unicode strings are sequences of Unicode scalar values, not code points.

> unassigned code points

Ah, the tyranny of Unicode version support. I was going to suggest that it could be reasonable to check all code points are assigned at data ingress time, but then you urgently need to make sure that your ingress system always supports the latest version of Unicode. As soon as some part of the system goes depending on old Unicode tables, some data processing may go wrong!

How about Private Use Area? You could surely reasonably forbid that!

> fullwidth characters

I’m not so comfortable with halfwidth/fullwidth distinctions, but couldn’t fullwidth characters be completely legitimate?

(Yes, I’m happy to call mathematical bold, fraktur, &c. illegitimate for such purposes.)

> solved with NFKC normalization

I’d be very leery of doing this on storage; compatibility normalisations are fine for equivalence testing, things like search and such, but they are lossy, and I’m not confident that the lossiness won’t affect legitimate names. I don’t have anything specific in mind, just a general apprehension.

eyelidlessness a day ago | parent | prev | next [-]

That sounds like a reasonable assumption, but probably not strictly correct.

samatman 17 hours ago | parent | prev | next [-]

It's safe to reject Cc, Cn, and Cs. You should probably reject Co as well, even though elves can't input their names if you do that.

Don't reject Cf. That's asking for trouble.

chrismorgan 13 hours ago | parent [-]

Explanation for those not accustomed, based on <https://www.unicode.org/reports/tr44/#GC_Values_Table> (with my own commentary):

Cc: Control, a C0 or C1 control code. (Definitely safe to reject.)

Cn: Unassigned, a reserved unassigned code point or a noncharacter. (Safe to reject if you keep up to date with Unicode versions; but if you don’t stay up to date, you risk blocking legitimate characters defined more recently, for better or for worse. The fixed set of 66 noncharacters are definitely safe to reject.)

Cs: Surrogate, a surrogate code point. (I’d put it stronger: you must reject these, it’s wrong not to.)

Co: Private_Use, a private-use character. (About elf names, I’m guessing samatman is referring to Tolkien’s Tengwar writing system, as assigned in the ConScript Unicode Registry to U+E000–U+E07F. There has long been a concrete proposal for inclusion in Unicode’s Supplementary Multilingual Plane <https://www.unicode.org/roadmaps/smp/>, from time to time it gets bumped along, and since fairly recently the linked spec document is actually on unicode.org, not sure if that means something.)

Cf: Format, a format control character. (See the list at <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[...>. You could reject a large number of these, but some are required by some scripts, such as ZERO-WIDTH NON-JOINER in Indic scripts.)

baruchel 19 hours ago | parent | prev | next [-]

Mandatory reference: https://xkcd.com/327/

kijin a day ago | parent | prev [-]

Challenge accepted, I'll try to put a backspace and a null byte in my firstborn's name. Hope I don't get swatted for crashing the government servers.