Remix.run Logo
integralid 4 days ago

I'm not certain... On one hand I agree that some characters are problematic (or invalid) - like unpaired surrogates. But the worst case scenario is imo when people designing data structures and protocols start to feel the need to disallow arbitrary classes of characters, even properly escaped.

In the example, username validation is a job of another layer. For example I want to make sure username is shorter than 60 characters, has no emojis or zalgo text, and yes, no null bytes, and return a proper error from the API. I don't want my JSON parsing to fail on completely different layer pre-validation.

And for username some classes are obviously bad - like explained. But what if I send text files that actually use those weird tabs. I expect things that work in my language utf8 "string" type to be encodable. Even more importantly, I see plenty of use cases for null byte, and it is in fact often seen in JSON in the wild.

On the other hand, if we have to use a restricted set of "normal" Unicode characters, having a standard feels useful - better than everyone creating their own mini standard. So I think I like the idea, just don't buy the argumentation or examples in the blog post.

csande17 4 days ago | parent | next [-]

Yeah, I feel like the only really defensible choices you can make for string representation in a low-level wire protocol in 2025 are:

- "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

- "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"

- "Potentially ill-formed UTF-8", aka "an array of bytes", aka "the Go string type"

- Any of the above, plus "no U+0000", if you have to interface with a language/library that was designed before people knew what buffer overflow exploits were

mort96 4 days ago | parent | next [-]

> - "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"

I thought WTF-8 was just, "UTf-8, but without the restriction to not encode unpaired surrogates"? Windows and Java and JavaScript all use "possibly ill-formed UTF-16" as their string type, not WTF-8.

layer8 4 days ago | parent | next [-]

Also known as UCS-2: https://www.unicode.org/faq/utf_bom.html#utf16-11

Surrogate pairs were only added with Unicode 2.0 in 1996, at which point Windows NT and Java already existed. The fact that those continue to allow unpaired surrogate characters is in parts due to backwards compatibility.

account42 2 days ago | parent | next [-]

No, UCS-2 decoding would convert all surrogates into individual code points but this isn't how "WTF-16" systems like Windows behave - paired surrogates get decoded into a combined code point.

da_chicken 4 days ago | parent | prev [-]

Yeah, people forget that Windows and Java appear to be less compliant, but the reality is that they moved on i18n before anybody else did so their standard is older.

Linux got to adopt UTF-8 because the just stuck their head in the sand and stayed on ASCII well past the time they needed to change. Even now, a lot of programs only support ASCII character streams.

mananaysiempre 4 days ago | parent | prev | next [-]

WTF-8 is more or less the obvious thing to use when NT/Java/JavaScript-style WTF-16 needs to fit into a UTF-8-shaped hole. And yes, it’s UTF-8 except you can encode surrogates except those surrogates can’t form a valid pair (use the normal UTF-8 encoding of the codepoint designated by that pair in that case).

(Some people instead encode each WTF-16 surrogate independently regardless of whether it participates in a valid pair or not, yielding an UTF-8-like but UTF-8-incompatible-beyond-U+FFFF thing usually called CESU-8. We don’t talk about those people.)

layer8 4 days ago | parent [-]

The parent’s point was that “potentially ill-formed UTF-16" and "WTF-8" are inherently different encodings (16-bit word sequence vs. byte sequence), and thus not “aka”.

csande17 4 days ago | parent [-]

Although they're different encodings, the thing that they are encoding is exactly the same. I kinda wish I could edit "string representation" to "modeling valid strings" or something in my original comment for clarity...

layer8 4 days ago | parent [-]

By that logic, you could say ‘“UTF-8” aka “UTF-32”’, since they are encoding the same value space. But that’s just wrong.

deathanatos 4 days ago | parent [-]

The type is the same, i.e., if you look at a type as an infinite set of values, they are the same infinite set. Yes, their in-memory representations might differ, but it means all values in one exist in the other, and only those, so conversion between them are infallible.

So in your last example, UTF-8 & UTF-32 are the same type, containing the same infinite set of values, and ­— of course — one can convert between them infallibly.

But you can't encode arbitrary Go strings in WTF-8 (some are not representable), you can't encode arbitrary Python strings in UTF-8 or WTF-8 (n.b. that upthread is wrong about Python being equivalent to Unicode scalars/well-formed UTF-*.) and attempts to do so might error. (E.g., `.encode('utf-8')` in Python on a `str` can raise.)

account42 2 days ago | parent | prev | next [-]

Yes they use WTF-16 not WTF-8 but WTF-8 is a compatible encoding.

zahlman 4 days ago | parent | prev [-]

I've always taken "WTF-8" to mean that someone had mistakenly interpreted UTF-8 data as being in Latin-1 (or some other code page) and UTF-8 encoded it again.

deathanatos 4 days ago | parent | next [-]

No, WTF-8[1] is a precisely defined format (that isn't that).

If you imagine a format that can encode JavaScript strings containing unpaired surrogates, that's WTF-8. (Well-formed WTF-8 is the same type as a JS string, through with a different representation.)

(Though that would have been cute name for the UTF-8/latin1/UTF-8 fail.)

[1]: https://simonsapin.github.io/wtf-8/

Izkata 4 days ago | parent [-]

GP is right about the original meaning, author of that page acknowledges hijacking it here: https://news.ycombinator.com/item?id=9611710

zahlman 3 days ago | parent [-]

When I posted that, I was honestly projecting from my own use. I think I may have independently thought of the term on Stack Overflow prior to koalie's tweet, but it's not the easiest thing (by design) to search for comments there (and that's assuming they don't get deleted, which they usually should).

(On review, it appears that the thread mentions much earlier uses...)

Izkata 3 days ago | parent [-]

I did the search because I have a similar memory, I'd place it in the early 2000s before StackOverflow existed, around when people were first switching from latin1 and Windows-1251 and others to UTF-8 on the web and browsers would often pick the wrong encoding, and IE had a submenu where you could tell it which one to use on the page. WTF-8 was a thing because occasionally none of these options would work, because the layers server-side would be misconfigured and cause the double (or more, if it involved user input) encoding. It was also used just in general to complain about UTF-8 breaking everything as it was slowly being introduced.

chrismorgan 4 days ago | parent | prev | next [-]

That thing was occasionally called WTF-8, but not often—it was normally called “double UTF-8” (if given a name at all).

In the last few years, the name has become very popular with Simon Sapin’s definition.

LocalH 2 days ago | parent | next [-]

Say "double UTF-8" out loud ;)

jibal 4 days ago | parent | prev [-]

"if given a name at all"

https://en.wikipedia.org/wiki/Mojibake

zahlman 3 days ago | parent [-]

This describes a broader concept.

4 days ago | parent | prev | next [-]
[deleted]
4 days ago | parent | prev | next [-]
[deleted]
4 days ago | parent | prev [-]
[deleted]
alright2565 4 days ago | parent | prev | next [-]

> "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

Can you elaborate more on this? I understood the Python string to be UTF-32, with optimizations where possible to reduce memory use.

csande17 4 days ago | parent | next [-]

I could be mistaken, but I think Python cares about making sure strings don't include any surrogate code points that can't be represented in UTF-16 -- even if you're encoding/decoding the string using some other encoding. (Possibly it still lets you construct such a string in memory, though? So there might be a philosophical dispute there.)

Like, the basic code points -> bytes in memory logic that underlies UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]. But UTF-16 can't because the first sequence is a surrogate pair. So if your language applies the restriction that strings can't contain surrogate code points, it's basically emulating the UTF-16 worldview on top of whatever encoding it uses internally. The set of strings it supports is the same as the set of strings a language that does use well-formed UTF-16 supports, for the purposes of deciding what's allowed to be represented in a wire protocol.

MyOutfitIsVague 4 days ago | parent | next [-]

You're somewhat mistaken, in that "UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]." You're right that the encoding on a raw level is technically capable of this, but it is actually forbidden in Unicode. Those are invalid codepoints.

Using those codepoints makes for invalid Unicode, not just invalid UTF-16. Rust, which uses utf-8 for its String type, also forbids unpaired surrogates. `let illegal: char = 0xDEADu32.try_into().unwrap();` panics.

It's not that these languages emulate the UTF-16 worldview, it's that UTF-16 has infected and shaped all of Unicode. No code points are allowed that can't be unambiguously represented in UTF-16.

edit: This cousin comment has some really good detail on Python in particular: https://news.ycombinator.com/item?id=44997146

csande17 4 days ago | parent [-]

The Unicode Consortium has indeed published documents recommending that people adopt the UTF-16 worldview when working with strings, but it is not always a good idea to follow their recommendations.

4 days ago | parent | prev [-]
[deleted]
zahlman 4 days ago | parent | prev [-]

You're not wrong; I gave more detail in a direct reply https://news.ycombinator.com/item?id=44997146 .

stuartjohnson12 4 days ago | parent | prev | next [-]

> "WTF-8", aka "the JavaScript string type"

This sequence of characters is a work of art.

wging 4 days ago | parent [-]

For more details: https://simonsapin.github.io/wtf-8/

dcrazy 4 days ago | parent | prev | next [-]

Why didn’t you include “Unicode Scalars”, aka “well-formed UTF-8”, aka “the Swift string type?”

Either way, I think the bitter lesson is a parser really can’t rely on the well-formedness of a Unicode string over the wire. Practically speaking, all wire formats are potentially ill-formed until parsed into a non-wire format (or rejected by same parser).

csande17 4 days ago | parent | next [-]

IMO if you care about surrogate code points being invalid, you're in "designing the system around UTF-16" territory conceputally -- even if you then send the bytes over the wire as UTF-8, or some more exotic/compressed format. Same as how "potentially ill-formed UTF-16" and WTF-8 have the same underlying model for what a string is.

dcrazy 4 days ago | parent [-]

The Unicode spec itself is designed around UTF-16: the block of code points that surrogate pairs would map to are reserved for that purpose and explicitly given “no interpretation” by the spec. [1] An implementation has to choose how to behave if it encounters one of these reserved code points in e.g. a UTF-8 string: Throw an encoding error? Silently drop the character? Convert it to an Object Replacement character?

[1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

duckerude 4 days ago | parent [-]

RFC 3629 says surrogate codepoints are not valid in UTF-8. So if you're decoding/validating UTF-8 it's just another kind of invalid byte sequence like a 0xFF byte or an overlong encoding. AFAIK implementations tend to follow this. (You have to make a choice but you'd have to make that choice regardless for the other kinds of error.)

If you run into this when encoding to UTF-8 then your source data isn't valid Unicode and it depends on what it really is if not proper Unicode. If you can validate at other boundaries then you won't have to deal with it there.

account42 2 days ago | parent [-]

> You have to make a choice but you'd have to make that choice regardless for the other kinds of error.

If you don't actively make a choice then decoding al la WTF-8 comes natural. Anything else is going to need additional branches.

layer8 4 days ago | parent | prev [-]

There is no disagreement that what you can receive over the wire can be ill-formed. There is disagreement about what to reject when it is first parsed at a point where it is known that it should be representing a Unicode string.

OCTAGRAM 3 days ago | parent | prev | next [-]

Seed7 uses UTF-32. Ada standard library has got UTF-32 for I/O data and file names. Ada is such a language where almost nothing disappears in standard library, so 8-bit and UTF-16 I/O and/or file names are all still there.

zahlman 4 days ago | parent | prev [-]

>"Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

"the Python string type" is neither "UTF-16" nor "well-formed", and there are very deliberate design decisions behind this.

Since Python 3.3 with the introduction of https://peps.python.org/pep-0393/ , Python does not use anything that can be called "UTF-16" regardless of compilation options. (Before that, in Python 2.2 and up the behaviour was as in https://peps.python.org/pep-0261/ ; you could compile either a "narrow" version using proper UTF-16 with surrogate pairs, or a "wide" version using UTF-32.)

Instead, now every code point is represented as a separate storage element (as they would be in UTF-32) except that the allocated memory is dynamically chosen from 1/2/4 bytes per element as needed. (It furthermore sets a flag for 1-byte-per-element strings according to whether they are pure ASCII or if they have code points in the 128..255 range.)

Meanwhile, `str` can store surrogates even though Python doesn't use them normally; errors will occur at encoding time:

  >>> x = '\ud800\udc00'
  >>> x
  '\ud800\udc00'
  >>> print(x)
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
They're even disallowed for an explicit encode to utf-16:

  >>> x.encode('utf-16')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-16' codec can't encode character '\ud800' in position 0: surrogates not allowed
But this can be overridden:

  >>> x.encode('utf-16-le', 'surrogatepass')
  b'\x00\xd8\x00\xdc'
Which subsequently allows for decoding that automatically interprets surrogate pairs:

  >>> y = x.encode('utf-16-le', 'surrogatepass').decode('utf-16-le')
  >>> y
  '𐀀'
  >>> len(y)
  1
  >>> ord(y)
  65536
Storing surrogates in `str` is used for smuggling in binary data. For example, the runtime does it so that it can try to interpret command line arguments as UTF-8 by default, but still allow arbitrary (non-null) bytes to be passed (since that's a thing on Linux):

  $ cat cmdline.py 
  #!/usr/bin/python
  
  import binascii, sys
  for arg in sys.argv[1:]:
      abytes = arg.encode(sys.stdin.encoding, 'surrogateescape')
      ahex = binascii.hexlify(abytes)
      print(ahex.decode('ascii'))
  $ ./cmdline.py foo
  666f6f
  $ ./cmdline.py 日本語
  e697a5e69cace8aa9e
  $ ./cmdline.py $'\x01\x00\x02'
  01
  $ ./cmdline.py $'\xff'
  ff
  $ ./cmdline.py ÿ
  c3bf
It does this by decoding with the same 'surrogateescape' error handler that the above diagnostic needs when re-encoding:

  >>> b'\xff'.decode('utf-8')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
  >>> b'\xff'.decode('utf-8', 'surrogateescape')
  '\udcff'
Joker_vD 4 days ago | parent | prev | next [-]

Seriously, please don't use C0 (except for LF and, I cede grudgingly, HT) and C1 characters in your plain text files. I understand that you may want to store some "ANSI coloring markup" (it's not "VT100 colors" — the VT series was monochrome until VT525 of 1994), sure, but it's then, arguably, not a plain text anymore, is it? It's in a text markup format of sorts, not unlike Markdown, only the one that uses a different encoding that dips into the C0 range. Just because your favourite output device can display it prettily when you cat your data into it doesn't really mean it's a plain text.

Yes, I do realize that there is a lot of text markup formats that encode into plain text, for better interoperability.

cesarb 4 days ago | parent | next [-]

> Seriously, please don't use C0 (except for LF and, I cede grudgingly, HT) and C1 characters in your plain text files.

It is (or, at least, used to be) common to have FF characters on plain text files, as a signal for your (dot matrix) printer to advance to the next page. So I'd add at least FF to that list.

afiori 3 days ago | parent [-]

I think we can deprecate dot printer specific advice wrt what we consider plain text today

Aaron2222 3 days ago | parent | prev [-]

The DEC VT241 from 1984 had colour.

https://terminals-wiki.org/wiki/index.php/DEC_VT240

https://www.1000bit.it/ad/bro/digital/DECVT240.pdf

TheRealPomax 4 days ago | parent | prev | next [-]

I think you missed the part where the RFC is about which Unicode is bad for protocols and data formats, and so which Unicode you should avoid when designing those from now on, with an RFC to consult to know which ones those are. It has nothing to do with "what if I have a file with X" or "what if I want Y in usernames", it's about "what should I do if I want a normal, well behaved, unicode-text-based protocol or data format".

It's not about JSON, or the web, those are just example vehicles for the discussion. The RFC is completely agnostic about what thing the protocols or data formats are intended for, as long as they're text based, and specifically unicode text based.

So it sounds like you like misread the blog post, and what you should be doing is now read the RFC. It's short. You can cruise through https://www.rfc-editor.org/rfc/rfc9839.html in a few minutes and see it's not actually about what you're focussing on.

justin66 3 days ago | parent | prev | next [-]

> But the worst case scenario is imo when people designing data structures and protocols start to feel the need to disallow arbitrary classes of characters, even properly escaped.

This seems like an extremely sheltered person’s point of view. I’m sure the worst case scenario involves a software defect in the parser or whatever and some kind of terrible security breach…

numpad0 3 days ago | parent | prev | next [-]

What system take UTF-8 for usernames? Everyone knows that all programmatically manipulated and/or evaluated identifiers including login usernames and passwords need to be in ASCII - not even ISO-8859-1, just plain old ASCII. Unicode generally don't work for those purposes. Username as in friendly display strings is fine, but for username as in system login, the entire non-ASCII encoding is a no go.

I mean, I don't even know my keyboard software is consistent in UTF-8 for the exact same intended visual representation outside of ASCII range, let alone across different operating systems and configurations, or over time. Or vice versa; the binary I would leave behind in time to consistently correspond to future Unicode interpretation AIs.

... speaking of consistency, neither the article nor RFC 9839 don't mention IVS situations or NFC/NFD/NFKC/NFKD regularizations problem as explicitly in or out of scope. Overall it feels like this RFC is missing the entire "Purpose" section except there is vague notion of there being non-character code points.

zzo38computer 2 days ago | parent | next [-]

For passwords, you might not need to care about the character encoding, since they are not going to be displayed anyways. You should allow any password, and the maximum length should not be too short.

For usernames, I think your point is valid; you might restrict usernames to a subset of ASCII (not arbitrary ASCII; e.g. you might disallow spaces and some punctuations), or use numeric user IDs, while the display name might be less restricted. (In some cases (probably uncommon) you might also use a different character set than ASCII if that is desirable for your application, but Unicode is not a good way to do it.)

(I also think that Unicode is not good; it is helpful for many applications to have i18n (although you should be aware what parts should use it and what shouldn't), but Unicode is not a good way to do it.)

numpad0 2 days ago | parent [-]

> For passwords, you might not need to care about the character encoding, since they are not going to be displayed anyways.

That would be reasonable if there were strict 1:1 correspondence between intended text and binary representations, but there isn't. Unicode has equivalents of British and American spellings, and users has no control over which to use. Precomposed vs Combining characters, Variant Selectors, etc. Ensuring it all regularize into canonical password string as developer obligation is unreasonable, and just falling back to ASCII is much more reasonable.

I guess everyone using alphanumeric sequences for every identifiers is somewhat imperialistic in a sense, but it's close to the least controversial of general cultural imperialism problems. It's probably okay to leave it to be solved for a century or two.

Timwi 3 days ago | parent | prev [-]

This is such a provincial attitude, wanting to prohibit people from using perfectly normal names like Amélie or Jürgen or Ольга to log in just because you as a programmer can't be bothered to deal with a numerical ID instead.

numpad0 3 days ago | parent | next [-]

You can have such names as Amélie or Jürgen or Ольга or 𠮷野 or 鎮󠄁 and have them displayed on account management screens, you just can't use it for login IDs because there are no guarantees that those blobs can be reproduced in the future or be consistent with what it was at time of entry.

Unicode is that bad.

account42 2 days ago | parent | prev [-]

No, it's an entirely practical attitude, unlike outrage culture.

singpolyma3 4 days ago | parent | prev | next [-]

Why ban emoji in username?

account42 2 days ago | parent | next [-]

You typically want a bijection from rendered glyphs to binary representation and restricting to ASCII is the most straight forward way to achieve that.

afiori 3 days ago | parent | prev | next [-]

My reasoning for it would be that they can be very keyboard specific and might require more normalisation than other character classes.

If I had to make a specific choice I would probably whitelist the most common emojis for some definition and allow those

pas 4 days ago | parent | prev | next [-]

I think for username it's fine, where a bit of restraint makes sense is for billing/shipping/legal-ish data.

numpad0 3 days ago | parent | prev [-]

Because username and password MUST be in ASCII range?

Timwi 3 days ago | parent [-]

No.

numpad0 3 days ago | parent [-]

How no? UTF-8 strings has no singular canonical binary representations and typing sequences that correspond to intended texts. Which means it can't be hashed and compared for authentication purposes. No?

TacticalCoder 4 days ago | parent | prev | next [-]

> In the example, username validation is a job of another layer. For example I want to make sure username is shorter than 60 characters, has no emojis or zalgo text, and yes, no null bytes, and return a proper error from the API. I don't want my JSON parsing to fail on completely different layer pre-validation.

Usernames are a bad examples. Because at the point you mention, you may as well only allow a subset of visible ASCII. Which a lot of sites do and that works perfectly fine.

But for stuff like family names you have to restrict so many thing otherwise you'll have little-bobby-zalgo-with-hangul-modifiers breaking havoc.

Unicode is the problem. And workarounds are sadly needed due to the clusterfuck that Unicode is.

Like TFA shows. Like any single homographic attack using Unicode characters shows.

If Unicode was good, it wouldn't regularly be frontpage of HN.

CharlesW 4 days ago | parent | prev [-]

> I like the idea, just don't buy the argumentation or examples in the blog post.

Which ones, and why? Tim and Paul collectively have around 100,000X the experience with this than most people do, so it'd be interesting to read substantive criticism.

It seems like you think this standard is JSON-specific?

doug_durham 4 days ago | parent [-]

I thought the question was pretty substantive. What layer in the code stack should make the decisions about what characters to allow? I had exactly the same question. If the library declares that it will filter out certain subsets then that allows me to choose a different library if needed. I would hate to have this RFC blindly implemented randomly just because it's a standard.

CharlesW 4 days ago | parent | next [-]

> What layer in the code stack should make the decisions about what characters to allow?

I was responding to the parent's empty sniping as gently as I could, but the answer to your (good) question has nothing to do with this RFC specifically. It's something that people doing sanitation/validation/serialization have had to learn.

The answer to your question is that you make decisions like this as a policy in your business layer/domain, and then you enforce it (consistently) in multiple places. For example, usernames might be limited to lowercase letters, numbers, and dashes so they're stable for identity and routing, while display names generally have fewer limitations so people can use accented characters or scripts from different languages. The rules live in the business/domain layer, and then you use libraries to enforce them everywhere (your API, your database, your UI, etc.).

vintermann 4 days ago | parent | prev [-]

> What layer in the code stack should make the decisions about what characters to allow?

OK, but where does it get decided what even counts a character? Should that be in the same layer? Even within a single system, there may be different sensible answers to that.