Remix.run Logo
ninkendo 4 days ago

It seems like most of these are handled by just rejecting invalid UTF-8 byte sequences (ideally, erroring out altogether) when interpreting a string as UTF-8. I mean, unpaired surrogates, or any surrogate for that matter, is already illegal as a UTF-8 byte sequence. Any competent language that uses UTF-8 for strings should already be returning errors when given such sequences.

The list of code points which are problematic (non-printing, etc) are IMO much more useful and nontrivial. But it’d be useful to treat those as a separate concept from plain-old illegal UTF-8 byte sequences.

doug_durham 4 days ago | parent | next [-]

That seems reasonable. It should be up to the application implementer to make that choice and not a lower level more general purpose library. I haven't run into any JSON parsers for usernames only code.

account42 2 days ago | parent | prev [-]

> Any competent language that uses UTF-8 for strings should already be returning errors when given such sequences.

No they shouldn't because that's how you get file managers that can't manage files.

ninkendo 2 days ago | parent [-]

The file manager wouldn’t use the “string” type to hold file names, if it’s written properly. Languages like Rust have things like OsString as separate from String for just this reason.

If you have a type that says “my contents are valid UTF-8”, then you should reject invalid UTF-8 when populating it, obviously. Why would it work any other way? If you need a type that can hold arbitrary byte sequences, use a type that can hold arbitrary byte sequences.

account42 a day ago | parent [-]

This is an unrealistic expectation. Local file names are just one example of many where you need to deal with UTF-8ish data that you should interpret as UTF-8 for display but pass along unmangled to other systems. Storing all that data twice and duplicating all relevant operations is both inefficient and will introduce more bugs as the two strings get out of sync. The gains from enforcing strict UTF-8 validation are minimal while the downsides are many - not the least of which is intentionally breaking forward compatibility with future Unicode versions that may extend what is valid.

It's is also not what happens in practice. File managers that cannot rename or delete some files because they are unnecessarily "smart" about interpreting strings passed to them is very much how things have worked out in reality.