> - "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"

I thought WTF-8 was just, "UTf-8, but without the restriction to not encode unpaired surrogates"? Windows and Java and JavaScript all use "possibly ill-formed UTF-16" as their string type, not WTF-8.

▲

layer8 4 days ago | parent | next [-]

Also known as UCS-2: https://www.unicode.org/faq/utf_bom.html#utf16-11

Surrogate pairs were only added with Unicode 2.0 in 1996, at which point Windows NT and Java already existed. The fact that those continue to allow unpaired surrogate characters is in parts due to backwards compatibility.

	▲	account42 2 days ago \| parent \| next [-]
		No, UCS-2 decoding would convert all surrogates into individual code points but this isn't how "WTF-16" systems like Windows behave - paired surrogates get decoded into a combined code point.
	▲	da_chicken 4 days ago \| parent \| prev [-]
		Yeah, people forget that Windows and Java appear to be less compliant, but the reality is that they moved on i18n before anybody else did so their standard is older. Linux got to adopt UTF-8 because the just stuck their head in the sand and stayed on ASCII well past the time they needed to change. Even now, a lot of programs only support ASCII character streams.

▲

mananaysiempre 4 days ago | parent | prev | next [-]

WTF-8 is more or less the obvious thing to use when NT/Java/JavaScript-style WTF-16 needs to fit into a UTF-8-shaped hole. And yes, it’s UTF-8 except you can encode surrogates except those surrogates can’t form a valid pair (use the normal UTF-8 encoding of the codepoint designated by that pair in that case).

(Some people instead encode each WTF-16 surrogate independently regardless of whether it participates in a valid pair or not, yielding an UTF-8-like but UTF-8-incompatible-beyond-U+FFFF thing usually called CESU-8. We don’t talk about those people.)

▲

layer8 4 days ago | parent [-]

The parent’s point was that “potentially ill-formed UTF-16" and "WTF-8" are inherently different encodings (16-bit word sequence vs. byte sequence), and thus not “aka”.

▲

csande17 4 days ago | parent [-]

Although they're different encodings, the thing that they are encoding is exactly the same. I kinda wish I could edit "string representation" to "modeling valid strings" or something in my original comment for clarity...

▲

layer8 4 days ago | parent [-]

By that logic, you could say ‘“UTF-8” aka “UTF-32”’, since they are encoding the same value space. But that’s just wrong.

	▲	deathanatos 4 days ago \| parent [-]
		The type is the same, i.e., if you look at a type as an infinite set of values, they are the same infinite set. Yes, their in-memory representations might differ, but it means all values in one exist in the other, and only those, so conversion between them are infallible. So in your last example, UTF-8 & UTF-32 are the same type, containing the same infinite set of values, and — of course — one can convert between them infallibly. But you can't encode arbitrary Go strings in WTF-8 (some are not representable), you can't encode arbitrary Python strings in UTF-8 or WTF-8 (n.b. that upthread is wrong about Python being equivalent to Unicode scalars/well-formed UTF-.) and attempts to do so might* error. (E.g., `.encode('utf-8')` in Python on a `str` can raise.)

▲

account42 2 days ago | parent | prev | next [-]

Yes they use WTF-16 not WTF-8 but WTF-8 is a compatible encoding.

▲

zahlman 4 days ago | parent | prev [-]

I've always taken "WTF-8" to mean that someone had mistakenly interpreted UTF-8 data as being in Latin-1 (or some other code page) and UTF-8 encoded it again.

▲

deathanatos 4 days ago | parent | next [-]

No, WTF-8[1] is a precisely defined format (that isn't that).

If you imagine a format that can encode JavaScript strings containing unpaired surrogates, that's WTF-8. (Well-formed WTF-8 is the same type as a JS string, through with a different representation.)

(Though that would have been cute name for the UTF-8/latin1/UTF-8 fail.)

[1]: https://simonsapin.github.io/wtf-8/

▲

Izkata 4 days ago | parent [-]

GP is right about the original meaning, author of that page acknowledges hijacking it here: https://news.ycombinator.com/item?id=9611710

▲

zahlman 3 days ago | parent [-]

When I posted that, I was honestly projecting from my own use. I think I may have independently thought of the term on Stack Overflow prior to koalie's tweet, but it's not the easiest thing (by design) to search for comments there (and that's assuming they don't get deleted, which they usually should).

(On review, it appears that the thread mentions much earlier uses...)

	▲	Izkata 3 days ago \| parent [-]
		I did the search because I have a similar memory, I'd place it in the early 2000s before StackOverflow existed, around when people were first switching from latin1 and Windows-1251 and others to UTF-8 on the web and browsers would often pick the wrong encoding, and IE had a submenu where you could tell it which one to use on the page. WTF-8 was a thing because occasionally none of these options would work, because the layers server-side would be misconfigured and cause the double (or more, if it involved user input) encoding. It was also used just in general to complain about UTF-8 breaking everything as it was slowly being introduced.

▲

chrismorgan 4 days ago | parent | prev | next [-]

That thing was occasionally called WTF-8, but not often—it was normally called “double UTF-8” (if given a name at all).

In the last few years, the name has become very popular with Simon Sapin’s definition.

▲

LocalH 2 days ago | parent | next [-]

Say "double UTF-8" out loud ;)

▲

jibal 4 days ago | parent | prev [-]

"if given a name at all"

https://en.wikipedia.org/wiki/Mojibake