Remix.run Logo
Muromec 14 hours ago

You can treat names as byte blobs for as long as you don't use them for their purpose -- naming people.

Suppose you have a unicode blob of my name in your database and there is a problem and you need to call me and say hi. Would your customer representative be able to pronounce my name somewhat correctly?

>I think there're very few exceptions to this, probably something law-related, or if you have to interact with some legacy service.

Few exceptions for you is entirety of the service for others. At the very least you interact with legacy software of payment systems which have some ideas about what names should be.

kmoser 13 hours ago | parent | next [-]

> Would your customer representative be able to pronounce my name somewhat correctly?

Are you implying the CSR's lack of familiarity with the pronunciation of your name means your name should be stored/rendered incorrectly?

Muromec 11 hours ago | parent [-]

Quite the opposite actually. I want it stored correctly and in a way that both me and CSR can understand and so it can be used to interface with other systems.

I don’t however know which unicode subset to use, because you didn’t tell me in the signup form. I have many options, all of them correct, but I don’t know whether your CSR can read Ukrainian Cyrillic and whether you can tell what vocative case is and not use that when inerfacing with the government CA which expects nominative.

ACS_Solver 6 hours ago | parent | next [-]

I think you're touching on another problem, which is that we as users rarely know why the form wants a name. Is it to be used in emails, or for sending packages, or for talking to me?

My language also has a separate vocative case, but I live in a country that has no concept of it and just vestiges of a case system. I enter my name in the nominative, which then of course looks weird if I get emails/letters from them later - they have no idea to use the vocative. If I knew the form is just for sending me emails, I'd maybe enter my name in the vocative.

Engineers, or UX designers, or whoever does this, like to pretend names are simple. They're just not (obligatory reference to the "falsehoods about names" article). There are many distinct cases for why you may want my name and they may all warrant different input.

- Name to use in letters or emails. It doesn't matter if a CSR can pronounce this if it's used in writing, it should be a name I like to see in correspondence. Maybe it's in a script unfamiliar to most CSRs, or maybe it's just a vocative form.

- Name for verbal communication. Just about anything could be appropriate depending on the circumstances. Maybe an anglicized name I think your company will be able to pronounce, maybe a name in a non-Latin script if I expect it to be understood here, maybe a name in a Latin-extended script if I know most people will still say it reasonably well intuitively. But it could also be an entirely different name from the written one if I expect the written one to be butchered.

- Name for package deliveries. If I'm ordering a package from abroad, I want my name (and address) written in my local convention - I don't care if the vendor can't read it, first the package will make its way to my country using the country and postal code identifiers, and then it should have info that makes sense to the local logistics companies, not to the seller's IT system.

- Legal name because we're entering a contract or because my ID will be checked later on for some reason.

- Machine-readable legal name for certain systems like airlines. For most of the world's population, this is not the same as the legal name but of course English-language bias means this is often overlooked.

dgfitz 8 hours ago | parent | prev [-]

In this specific case, it seems like your concerns are a hypothetical, no?

swiftcoder 8 hours ago | parent [-]

Not really, no. A lot of us only really have to deal with English-adjacent input (i.e. European languages that share the majority of character forms with English, or cultures that explicitly Anglicise their names when dealing with English folks).

As soon as you have to deal with users with a radically different alphabet/input-method, the wheels tend to come off. Can your CSR reps pronounce names written in Chinese logographs? In Arabic script? In the Hebrew alphabet?

cowsandmilk 7 hours ago | parent [-]

You can analyze the name and direct a case to a CSR who can handle it. May be unrealistic for a 1-2 person company, but every 20+ person company I’ve worked at has intentionally hired CSRs with different language abilities.

Muromec 6 hours ago | parent | next [-]

First of, no you can't infer language preference from a name. The reasonable and well meaning assumption about my name on a good day makes me only sad and irritated.

And even if you could, I don't know if you actually do it by looking at what you signup form asks me to input.

michaelt 7 hours ago | parent | prev [-]

A requirement to do that is an extremely broad definition of "treat strings as opaque blobs most of the time" IMHO :)

arghwhat 6 hours ago | parent | prev | next [-]

> Suppose you have a unicode blob of my name in your database and there is a problem and you need to call me and say hi. Would your customer representative be able to pronounce my name somewhat correctly?

You cannot pronounce the name regardless of whether it is written in ASCII. Pronouncing a name requires at the very least knowledge of the language it originated in, and attempts at reading it with an English pronunciation can range from incomprehensible to outright offensive.

The only way to correctly deal with a name that you are unfamiliar with the pronunciation of is to ask how it is pronounced.

You must store and operate on the person's name as is. Requiring a name modified, or modifying it automatically, is unacceptable - in many cases legal names must be represented accurately as your records might be used for e.g. tax or legal reasons later.

Muromec 3 hours ago | parent | next [-]

>You must store and operate on the person's name as is. Requiring a name modified, or modifying it automatically, is unacceptable

But this is simply not true in practice and at times it's just plain wrong in theory too. The in practice part is trivially discoverable in the real world.

As to in theory -- I do in fact want a properly functioning service to use my name in a vocative case (which requires modifying it automatically or having a dictionary of names) in their communications that are sent in my native language. Not doing that is plainly grammatically wrong and borderline impolite. In fact I use services that do it just right. I also don't want to know to specify the correct version myself, as it's trivially derivable through established rules of the languages.

arghwhat an hour ago | parent [-]

Sure, there are sites that mistreat names in ways you describe, but that does not make it correct.

> I do in fact want a properly functioning service to use my name in a vocative case. ... I also don't want to know to specify the correct version myself, as it's trivially derivable through established rules of the languages.

There would be nothing to discuss if this was trivial.

> Not doing that is plainly grammatically wrong and borderline impolite.

Do you know what's more than borderline impolite? Getting someone's name wrong, or even claiming that their legal name is invalid and thereby making it impossible for them to sign up.

If getting a name right and using a grammatical form are mutually exclusive, there is no argument to be had about which to prioritize.

throw_a_grenade 2 hours ago | parent | prev [-]

Sorry to nitpick, but you underestimated: "many cases" is really "all cases", no exception, because under GDPR you have right to correct your data (this is about legal name, so obviously covered). So if user requests under GDPR art. 16 that his/her name is to be represented in a way that matches ID card or whatever legal document, then you either do it, or you pay a fine and then you do it.

That a particular technical solution is incapable of storing it in the preferred way is not an excuse. EBCDIC is incompatible with GDPR: https://news.ycombinator.com/item?id=28986735

kgeist 9 hours ago | parent | prev | next [-]

>Would your customer representative be able to pronounce my name somewhat correctly?

Typical input validation doesn't really solve the problem. For instance, I could enter my name as 'Vrdtpsk,' which is a perfectly valid ASCII string that passes all validation rules, but no one would be able to pronounce it correctly. I believe the representative (if on a call) should simply ask the customer how they would like to be addressed. Unless we want to implement a whitelist of allowed names for customers to choose from...

manarth 8 hours ago | parent [-]

Derek would like a word.

https://www.youtube.com/watch?v=hNoS2BU6bbQ

Intermernet 7 hours ago | parent | prev | next [-]

Many Japanese companies require an alternative name entered in half width kana to alleviate this exact problem. Unfortunately, most Japanese websites have a million other UX problems that overshadow this clever solution to the problem.

arghwhat 6 hours ago | parent [-]

This is a problem specific to languages using Chinese characters where most only know some characters and therefore might not be able to read a specific one. Furigana (which is ultimately what you're providing in a separate field here) is often used as a phonetic reading aid, but still requires you to know Japanese to read and pronounce it correctly.

The only generic solution I can think of would be IPA notation, but it would be entirely unreasonable to expect someone to know the IPA for their name, just as it would be unreasonable to expect a random third party to know how to read IPA and replicate the sounds it described.

red_admiral 4 hours ago | parent | prev | next [-]

> Would your customer representative be able to pronounce my name somewhat correctly?

If the user is Chinese and the CSR is not - probably no, and that's not a Unicode issue.

hobs 12 hours ago | parent | prev | next [-]

Absolutely not - do not build anything based on "would your CSR be able to pronounce" something - that's an awful bar - most CSRs cant pronounce my name - would I be excluded from your database?

Seriously, what are you going for here?

Muromec 11 hours ago | parent [-]

That’s the most basic consideration for names, unless you only show it to the user themselves — other people have to be able to read it at least somehow.

Which one is why the bag of unicode bytes approach is as wrong as telling Stęphań he has an invalid name.

hobs 10 hours ago | parent | next [-]

Absolutely not. There's no way to understand what a source user's reading capability is. There's no way to understand how a person will pronounce their name by simply reading it, this only works for common names.

soco 7 hours ago | parent | prev [-]

And here we go again, engineers expecting the world should behave fitting their framework du jour. Unfortunately, the real world doesn't care about our engineering bubble and goes on with life - where you can be called !xóõ Kxau or ꦱꦭꦪꦤ or X Æ A-12.

Muromec 3 hours ago | parent [-]

I can be called what I want and in fact I have perfectly reasonable name that doesn't fit neither ASCII nor FN+LN convention. The thing is, your website accepting whatever utf8 blob my name can be serialized to today, without actually understanding it, makes my life worse, not better.

hobs 2 hours ago | parent [-]

No, it allows an exact representation of your name, it doesn't do anything to your life.

If you dont like your name, either change it or go complain to your parents. They might tell you that you cultural reference point is more important than some person being able to read your name off of a computer screen.

If you want to store a phonetic name for the destination speaker that's not a bad idea, but a name is a name is a name. It is your unique identifier, do not munge it.

Muromec 2 hours ago | parent [-]

But it does affect my life in a way you refuse to understand. That's the problem -- there isn't a true canonical representation of a name (any name really) that fits all practical purposes. Storing a bag of bytes to display back to user is the easiest of practical purposes and suggesting the practice that solve that is worse than rejecting Stępień, it's refusal to understand complexities, that leads to eventually doing the wrong thing and failing your user without even telling them.

>It is your unique identifier, do not munge it.

It's not a good identifier either. Nobody uses names as identifiers at any scale that matters for computers. You can't assume they don't have collisions, you can't tell whether two bags of bytes identify the same person or two different, they aren't even immutable and sometimes are accidentally mutable.

soco an hour ago | parent [-]

Then where is the problem? If the support can read Polish they will pronounce your name properly, if they're from India they will mess it up, why should we have different expectations? Nobody will identify you by name anyway, they will ask how to call you (chatbots do this already) and then use for proper identification all kind of ids and pins and whatnot. So we are talking here about a complexity that nobody actually needs, not even you. So let the name be saved and displayed in the nice native way, and you as programmer make sure you don't go Bobby Tables with the strings.

Muromec 17 minutes ago | parent [-]

>if they're from India they will mess it up

Or not able to read at all.

>Then where is the problem?

Since you don't indicate for what purpose my name is stored, which may actually be display only, any of the following can happen:

A name as entered in your system is compared to a name entered in a different system or when you interface (maybe indirectly and unknowingly) with a system using different constrains or a different script, maybe imposed by their jurisdiction. As a result, the intended operation does not come through.

This may happen in the indirect way and invisible to you -- e.g. you produce an artifact, say and invoice or issue a payment card using $script a, which I will only later figure out I can't use, because it's expected to be in $script b, or even worse be in $script a presumed to match $script b they have on record. One of the non-obvious ways it can fail, is when you try to determine whether two names in the same script are actually the same to infer family relationship or something other that you should not do anyway.

It may happen within your system in a way your CSR will deny is possible as well.

That's on a more severe side, which means I will not try to use the name in any rendering that doesn't match MRZ of my identity document. Which was probably the opposite of what you intended allowing arbitrary bag of bytes to be entered. No, that is not made up problem, because I'm bored, it's a thing.

On a less sever side, not understanding names is a failure in i18n department, because you can't support my language properly without understanding how my name should be changed when you address me, when you simply show it near user icon and when you describe relations between me and objects and people. If you can't do proper i18n and a different provider can, you may lose me as a customer, because your attitude is presumed to be "everyone can just use ASCII and English". Yes, people exist that actually get it right because they put an effort in this human aspect.

On a mildly annoying, but inconsequential side people also have a habit of trying to infer gender based on names despite having gender clearly marked in their system.

benatkin 10 hours ago | parent | prev [-]

> Would your customer representative be able to pronounce my name somewhat correctly?

Worse case, just drop to hexadecimal.