Remix.run Logo
kgeist 14 hours ago

My rule of thumb is to treat strings as opaque blobs most of the time. The only validation I'd always enforce is some sane length limit, to prevent users from shoving entire novels inside. If you treat your strings as opaque blobs, and use UTF8, most of internationalization problems go away. Imho often times, input validation is an attempt to solve a problem from the wrong side. Say, when XSS or SQL injections are found on a site, I've seen people's first reaction to be validation of user input by looking for "special symbols", or add a whitelist of allowed characters, instead of simply escaping strings right before rendering HTML (and modern frameworks do it automatically), or using parameterized queries if it's SQL. If a user wants to call themselves "alert('hello')", why not? Why the arbitrary limits? I think there're very few exceptions to this, probably something law-related, or if you have to interact with some legacy service.

rafram 11 hours ago | parent | next [-]

Sanitizing your strings immediately before display is all well and good until you need to pass them to some piece of third-party software that is very dumb and doesn’t sanitize them. You’ll argue that it’s the vendor’s fault, but the vendor will argue that nobody else allows characters like that in their name inputs!

See the Companies House XSS injection situation, where their rationale for forcing a business to change its name was that others using their database could be vulnerable: https://www.theregister.com/2020/10/30/companies_house_xss_s...

arkh 6 hours ago | parent | next [-]

You sanitize at the frontier of what your code controls.

Sending data to a database: parametrized queries to sanitize as it is leaving your control.

Sending to display to the user: sanitized for a browser

Sending to an API: sanitize for whatever rules the API has

Sending to a legacy system: sanitize for it

Writing a file to the system: sanitize the path

The common point is you don't sanitize before you have to send it somewhere. And the advantage of this method is that you limit the chances of getting bit by reflected injections. You interrogate some API you don't control, you may just get malicious content, but you sanitize when sending it so all is good. Because you're sanitizing on output and not on input.

shaky-carrousel 2 hours ago | parent [-]

Be liberal in what you accept, and conservative in what you send.

afiori 6 hours ago | parent | prev | next [-]

Forbidding users to use your service to propagate "litte bobby tables" pseudo-pranks is likely a good choice.

The choice is different if like most apps you are almost only a data sink, but if you are also a data source for others it pays to be cautious.

dcow 6 hours ago | parent [-]

I think it’s more of an ethical question than anything. There will always be pranksters and there will never be perfect input validation for names. So who do you oppress? The people with uncommon names? Or the pranksters? I happen to think that if you do your job right, the pranksters aren’t really a problem. So why oppress those with less common names?

afiori 5 hours ago | parent [-]

I am not saying to only allow [a-zA-Z ]+ in names, what I am Saying is that it is ok to block names like "'; drop table users;" or "<script src="https://bad.site.net/></script>" if part of your business is to distribute that data to other consumers.

dcow 5 hours ago | parent [-]

And I’m arguing, rhetorically, what if your name produces a syntax error—or worse means something semantically devious—in the query language I’m using? Not all problems look like script tags and semicolons.

foldr 3 hours ago | parent [-]

It's a question of intent. There aren't any hard and fast rules, but if someone has chosen their company name specifically in order to cause problems for other people using your service, then it's reasonable to make them change it.

rob74 7 hours ago | parent | prev [-]

> but the vendor will argue that nobody else allows characters like that in their name inputs

...and maybe they will even link to this page to support that statement! But, seeing that most of the pages are German, I bet they do accept the usual German "special" letters (ÄÖÜß) in names?

ddulaney 14 hours ago | parent | prev | next [-]

There's at least one major exception to this: Unicode normalization.

It's possible for the same logical character to have two different sets of code points (for example, a-with-umlaut as a single character, vs a followed by umlaut combining diacritic). Related, distinguishing between the "a" character in Latin, Greek, Cyrillic, and the handful of other times it shows up throughout Unicode.

This comes up in at least 3 ways:

1. A usability issue. It's not always easy to predict which identical-looking variant is produced by different input methods, so users enter the identical-looking characters on different devices but get an account not found error.

2. A security issue. If some of your backend systems handle these kinds of characters differently, that can cause all kinds of weird bugs, some of which can be exploited.

3. An abuse issue. If it's possible to create accounts with the same-looking name as others that aren't the same account, there can be vectors for impersonation, harassment, and other issues.

So you have to make a policy choice about how to handle this problem. The only things that I've seen work are either restricting the allowed characters (often just to printable ASCII) or being very clear and strict about always performing one of the standard Unicode transformations. But doing that transformation consistently across a big codebase has some real challenges: in particular, it can change based on Unicode version, and guaranteeing that all potential services use the same Unicode version is really non-trivial. So lots of people make the (sensible) choice not to deal with it.

But yeah, agreed that parenthesis should be OK.

speleding 4 hours ago | parent | next [-]

Something we just ran in to: There are two UTF-8 codepoints for the @ character, the normal one and "Full width At Sign U+FF20". It took a lot of head scratching to understand why several Japanese users could not be found with their email address when I was seeing their email right there in the database.

teddyh 3 hours ago | parent [-]

There are actually two more: U+FE6B and U+E0040.

tugu77 8 hours ago | parent | prev [-]

[dead]

wvh 5 hours ago | parent | prev | next [-]

Because you don't want to ever store bad data. There's not point to that, it will just create annoying situations and potential security risks. And the best place to catch bad data is when the user is still present so they can be made aware of the issue (in case they care and are able to solve it). Once they're gone, it becomes nearly impossible and/or very expensive to check what they meant.

Muromec 13 hours ago | parent | prev | next [-]

You can treat names as byte blobs for as long as you don't use them for their purpose -- naming people.

Suppose you have a unicode blob of my name in your database and there is a problem and you need to call me and say hi. Would your customer representative be able to pronounce my name somewhat correctly?

>I think there're very few exceptions to this, probably something law-related, or if you have to interact with some legacy service.

Few exceptions for you is entirety of the service for others. At the very least you interact with legacy software of payment systems which have some ideas about what names should be.

kmoser 13 hours ago | parent | next [-]

> Would your customer representative be able to pronounce my name somewhat correctly?

Are you implying the CSR's lack of familiarity with the pronunciation of your name means your name should be stored/rendered incorrectly?

Muromec 11 hours ago | parent [-]

Quite the opposite actually. I want it stored correctly and in a way that both me and CSR can understand and so it can be used to interface with other systems.

I don’t however know which unicode subset to use, because you didn’t tell me in the signup form. I have many options, all of them correct, but I don’t know whether your CSR can read Ukrainian Cyrillic and whether you can tell what vocative case is and not use that when inerfacing with the government CA which expects nominative.

ACS_Solver 6 hours ago | parent | next [-]

I think you're touching on another problem, which is that we as users rarely know why the form wants a name. Is it to be used in emails, or for sending packages, or for talking to me?

My language also has a separate vocative case, but I live in a country that has no concept of it and just vestiges of a case system. I enter my name in the nominative, which then of course looks weird if I get emails/letters from them later - they have no idea to use the vocative. If I knew the form is just for sending me emails, I'd maybe enter my name in the vocative.

Engineers, or UX designers, or whoever does this, like to pretend names are simple. They're just not (obligatory reference to the "falsehoods about names" article). There are many distinct cases for why you may want my name and they may all warrant different input.

- Name to use in letters or emails. It doesn't matter if a CSR can pronounce this if it's used in writing, it should be a name I like to see in correspondence. Maybe it's in a script unfamiliar to most CSRs, or maybe it's just a vocative form.

- Name for verbal communication. Just about anything could be appropriate depending on the circumstances. Maybe an anglicized name I think your company will be able to pronounce, maybe a name in a non-Latin script if I expect it to be understood here, maybe a name in a Latin-extended script if I know most people will still say it reasonably well intuitively. But it could also be an entirely different name from the written one if I expect the written one to be butchered.

- Name for package deliveries. If I'm ordering a package from abroad, I want my name (and address) written in my local convention - I don't care if the vendor can't read it, first the package will make its way to my country using the country and postal code identifiers, and then it should have info that makes sense to the local logistics companies, not to the seller's IT system.

- Legal name because we're entering a contract or because my ID will be checked later on for some reason.

- Machine-readable legal name for certain systems like airlines. For most of the world's population, this is not the same as the legal name but of course English-language bias means this is often overlooked.

dgfitz 8 hours ago | parent | prev [-]

In this specific case, it seems like your concerns are a hypothetical, no?

swiftcoder 8 hours ago | parent [-]

Not really, no. A lot of us only really have to deal with English-adjacent input (i.e. European languages that share the majority of character forms with English, or cultures that explicitly Anglicise their names when dealing with English folks).

As soon as you have to deal with users with a radically different alphabet/input-method, the wheels tend to come off. Can your CSR reps pronounce names written in Chinese logographs? In Arabic script? In the Hebrew alphabet?

cowsandmilk 7 hours ago | parent [-]

You can analyze the name and direct a case to a CSR who can handle it. May be unrealistic for a 1-2 person company, but every 20+ person company I’ve worked at has intentionally hired CSRs with different language abilities.

Muromec 6 hours ago | parent | next [-]

First of, no you can't infer language preference from a name. The reasonable and well meaning assumption about my name on a good day makes me only sad and irritated.

And even if you could, I don't know if you actually do it by looking at what you signup form asks me to input.

michaelt 6 hours ago | parent | prev [-]

A requirement to do that is an extremely broad definition of "treat strings as opaque blobs most of the time" IMHO :)

arghwhat 6 hours ago | parent | prev | next [-]

> Suppose you have a unicode blob of my name in your database and there is a problem and you need to call me and say hi. Would your customer representative be able to pronounce my name somewhat correctly?

You cannot pronounce the name regardless of whether it is written in ASCII. Pronouncing a name requires at the very least knowledge of the language it originated in, and attempts at reading it with an English pronunciation can range from incomprehensible to outright offensive.

The only way to correctly deal with a name that you are unfamiliar with the pronunciation of is to ask how it is pronounced.

You must store and operate on the person's name as is. Requiring a name modified, or modifying it automatically, is unacceptable - in many cases legal names must be represented accurately as your records might be used for e.g. tax or legal reasons later.

Muromec 2 hours ago | parent | next [-]

>You must store and operate on the person's name as is. Requiring a name modified, or modifying it automatically, is unacceptable

But this is simply not true in practice and at times it's just plain wrong in theory too. The in practice part is trivially discoverable in the real world.

As to in theory -- I do in fact want a properly functioning service to use my name in a vocative case (which requires modifying it automatically or having a dictionary of names) in their communications that are sent in my native language. Not doing that is plainly grammatically wrong and borderline impolite. In fact I use services that do it just right. I also don't want to know to specify the correct version myself, as it's trivially derivable through established rules of the languages.

arghwhat an hour ago | parent [-]

Sure, there are sites that mistreat names in ways you describe, but that does not make it correct.

> I do in fact want a properly functioning service to use my name in a vocative case. ... I also don't want to know to specify the correct version myself, as it's trivially derivable through established rules of the languages.

There would be nothing to discuss if this was trivial.

> Not doing that is plainly grammatically wrong and borderline impolite.

Do you know what's more than borderline impolite? Getting someone's name wrong, or even claiming that their legal name is invalid and thereby making it impossible for them to sign up.

If getting a name right and using a grammatical form are mutually exclusive, there is no argument to be had about which to prioritize.

throw_a_grenade 2 hours ago | parent | prev [-]

Sorry to nitpick, but you underestimated: "many cases" is really "all cases", no exception, because under GDPR you have right to correct your data (this is about legal name, so obviously covered). So if user requests under GDPR art. 16 that his/her name is to be represented in a way that matches ID card or whatever legal document, then you either do it, or you pay a fine and then you do it.

That a particular technical solution is incapable of storing it in the preferred way is not an excuse. EBCDIC is incompatible with GDPR: https://news.ycombinator.com/item?id=28986735

kgeist 8 hours ago | parent | prev | next [-]

>Would your customer representative be able to pronounce my name somewhat correctly?

Typical input validation doesn't really solve the problem. For instance, I could enter my name as 'Vrdtpsk,' which is a perfectly valid ASCII string that passes all validation rules, but no one would be able to pronounce it correctly. I believe the representative (if on a call) should simply ask the customer how they would like to be addressed. Unless we want to implement a whitelist of allowed names for customers to choose from...

manarth 8 hours ago | parent [-]

Derek would like a word.

https://www.youtube.com/watch?v=hNoS2BU6bbQ

Intermernet 7 hours ago | parent | prev | next [-]

Many Japanese companies require an alternative name entered in half width kana to alleviate this exact problem. Unfortunately, most Japanese websites have a million other UX problems that overshadow this clever solution to the problem.

arghwhat 6 hours ago | parent [-]

This is a problem specific to languages using Chinese characters where most only know some characters and therefore might not be able to read a specific one. Furigana (which is ultimately what you're providing in a separate field here) is often used as a phonetic reading aid, but still requires you to know Japanese to read and pronounce it correctly.

The only generic solution I can think of would be IPA notation, but it would be entirely unreasonable to expect someone to know the IPA for their name, just as it would be unreasonable to expect a random third party to know how to read IPA and replicate the sounds it described.

red_admiral 4 hours ago | parent | prev | next [-]

> Would your customer representative be able to pronounce my name somewhat correctly?

If the user is Chinese and the CSR is not - probably no, and that's not a Unicode issue.

hobs 12 hours ago | parent | prev | next [-]

Absolutely not - do not build anything based on "would your CSR be able to pronounce" something - that's an awful bar - most CSRs cant pronounce my name - would I be excluded from your database?

Seriously, what are you going for here?

Muromec 10 hours ago | parent [-]

That’s the most basic consideration for names, unless you only show it to the user themselves — other people have to be able to read it at least somehow.

Which one is why the bag of unicode bytes approach is as wrong as telling Stęphań he has an invalid name.

hobs 10 hours ago | parent | next [-]

Absolutely not. There's no way to understand what a source user's reading capability is. There's no way to understand how a person will pronounce their name by simply reading it, this only works for common names.

soco 6 hours ago | parent | prev [-]

And here we go again, engineers expecting the world should behave fitting their framework du jour. Unfortunately, the real world doesn't care about our engineering bubble and goes on with life - where you can be called !xóõ Kxau or ꦱꦭꦪꦤ or X Æ A-12.

Muromec 3 hours ago | parent [-]

I can be called what I want and in fact I have perfectly reasonable name that doesn't fit neither ASCII nor FN+LN convention. The thing is, your website accepting whatever utf8 blob my name can be serialized to today, without actually understanding it, makes my life worse, not better.

hobs 2 hours ago | parent [-]

No, it allows an exact representation of your name, it doesn't do anything to your life.

If you dont like your name, either change it or go complain to your parents. They might tell you that you cultural reference point is more important than some person being able to read your name off of a computer screen.

If you want to store a phonetic name for the destination speaker that's not a bad idea, but a name is a name is a name. It is your unique identifier, do not munge it.

Muromec 2 hours ago | parent [-]

But it does affect my life in a way you refuse to understand. That's the problem -- there isn't a true canonical representation of a name (any name really) that fits all practical purposes. Storing a bag of bytes to display back to user is the easiest of practical purposes and suggesting the practice that solve that is worse than rejecting Stępień, it's refusal to understand complexities, that leads to eventually doing the wrong thing and failing your user without even telling them.

>It is your unique identifier, do not munge it.

It's not a good identifier either. Nobody uses names as identifiers at any scale that matters for computers. You can't assume they don't have collisions, you can't tell whether two bags of bytes identify the same person or two different, they aren't even immutable and sometimes are accidentally mutable.

soco an hour ago | parent [-]

Then where is the problem? If the support can read Polish they will pronounce your name properly, if they're from India they will mess it up, why should we have different expectations? Nobody will identify you by name anyway, they will ask how to call you (chatbots do this already) and then use for proper identification all kind of ids and pins and whatnot. So we are talking here about a complexity that nobody actually needs, not even you. So let the name be saved and displayed in the nice native way, and you as programmer make sure you don't go Bobby Tables with the strings.

Muromec 4 minutes ago | parent [-]

>if they're from India they will mess it up

Or not able to read at all.

>Then where is the problem?

Problems usually happen when a name as entered in your system is compared to a name entered in a different system or when you interface (maybe indirectly and unknowingly) or with a system using different constrains or a different script, maybe imposed by their jurisdiction.

This may happen in the indirect way and invisible to you -- e.g. you produce an artifact, say and invoice or issue a payment card using $script a, which I will only later figure out I can't use, because it's expected to be in $script b, even worse be in $script b presumed to match $script a they have on record. One of the non-obvious ways it can fail, is when you try to determine whether two names in the same script are actually the same to infer family relationship or something other that you should not do anyway.

It may happen within your system in a way your CSR will deny is possible as well.

That's on a more severe side, which means I will not try to use the name in any rendering that doesn't match MRZ of my identity document. Which was probably the opposite of what you intended allowing arbitrary bag of bytes to be entered. No, that is not made up problem, because I'm bored, it's a thing.

On a less sever side, not understanding names is a failure in i18n department, because you can't support my language properly without understanding how my name should be changed when you address me, when you simply show it near user icon and when you describe relations between me and objects and people. If you can't do proper i18n and a different provider can, you may lose me as a customer, because your attitude is presumed to be "everyone can just use ASCII and English". Yes, people exist that actually get it right because they put an effort in this human aspect.

On a mildly annoying, but inconsequential side people also have a habit of trying to infer gender based on names despite having gender clearly marked in their system.

benatkin 10 hours ago | parent | prev [-]

> Would your customer representative be able to pronounce my name somewhat correctly?

Worse case, just drop to hexadecimal.

beagle3 5 hours ago | parent | prev | next [-]

You do need to use a canonical representation, or you will have two distinct blobs that look exactly the same, tricking other users of the data (other posters in a forum, customer service people in a company, etc)

JodieBenitez 5 hours ago | parent | prev | next [-]

> or if you have to interact with some legacy service.

Which happens almost every day in the real world.

lyu07282 4 hours ago | parent | prev | next [-]

> The only validation I'd always enforce is some sane length limit, [..]

Venture into the abyss of UTF-8 and behold the madness of multibyte characters. Diacritics dance devilishly upon characters, deceiving your simple count. Think a letter is but a single entity? Fools! Combining characters lurk in the shadows, binding invisibly, elongating the uninitiated's count into chaos. Every attempt to enumerate the true length of a string in UTF-8 conjures a specter of complications. Behold, a single glyph, yet multiple bytes cackle beneath, a multitude of codepoints coalesce in arcane unison. It is beautiful t he final snuffing of the lie s of Man ALL IS LOST ALL I S LOST the pony he comes he comes he comes the ich or permeates all MY FACE MY FACE ᵒh god no NO NOOO O NΘ stop the an * gles are n ot real ZALGΌ IS TOƝȳ THE PO NY HE COMES

77pt77 14 hours ago | parent | prev [-]

> If you treat your strings as opaque blobs, and use UTF8, most of internationalization problems go away

This is laughably naive.

So many things can go wrong.

Strings are not arrays of bytes.

There is a price to pay if someone doesn't understand that or chooses to ignore it.

shakna 9 hours ago | parent | next [-]

> Strings are not arrays of bytes.

That very much depends on the language that you are using. In some, they are.

hughesjj 14 hours ago | parent | prev | next [-]

RTL go brrr

rpigab 6 hours ago | parent [-]

RTL is so much fun, it's the gift that keeps on going, when I first encountered it I thought, ok, maybe some junior web app developers will sometimes forget that it exists and a fun bug or two will get into production, but it's everywhere, Windows, GNU/Linux, automated emails, it can make malware hardware to detect by users in Windows because you can hide the dotexe at the beginning of the filename, etc.

Here it is today in GNOME 46.0, after so many years, this should say "selected": https://github.com/user-attachments/assets/306737fb-6b01-467... In previous GNOME versions it would mess up even more text in the file properties window.

Here's an article about it, but I couldn't find the more interesting blogpost about RTL: https://krebsonsecurity.com/2011/09/right-to-left-override-a...

lelandbatey 10 hours ago | parent | prev [-]

And yet when stored on any computer system, that string will be encoded using some number of bytes. Which you can set a limit on even though you cannot cut, delimit, or make any other inference about that string from the bytes without doing some kind of interpretation. But the bytes limit is enough for the situation the OP is talking about.