Remix.run Logo
gavinsyancey 7 months ago

WTF-8 is actually a real encoding, used for encoding invalid UTF-16 unpaired surrogates for UTF-8 systems: https://simonsapin.github.io/wtf-8/

bjackman 7 months ago | parent | next [-]

I believe this is what Rust OsStrings are under the hood on Windows.

extraduder_ire 7 months ago | parent [-]

Which I assume stands for "Windows-Transformation-Format-8(bits)".

mmoskal 7 months ago | parent [-]

Abstract

WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair.

hedora 7 months ago | parent [-]

Can you still assume the bytes 0x00 and 0xFF are not present in the string (like in UTF-8?)

int_19h 7 months ago | parent [-]

Yes. The only difference between UTF-8 and WTF-8 is that the latter does not reject otherwise valid UTF-8 byte sequences that correspond to codepoints in range U+D800 to U+DFFF (which means that, in practice, a lot of things that say they are UTF-8 are actually WTF-8).

account42 7 months ago | parent [-]

Not really since you are unlikely to end up with unpaired surrogates outside of UTF-16 unless you explicitly implement a WTF-16 decoder - most other things are going to error out or remove/replace the garbage data when converting to another encoding.

And if you convert valid UTF-16 by interpreting them as UCS-2 and then not check for invalid code points you are going to end up with either valid UTF-8 or something that isn't even valid WTF-8 since that encoding disallows paired surrogates to be encoded individually.

WTF-16 is something that occurs naturally. WTF-8 isn't.

ptx 7 months ago | parent | prev [-]

Yeah, that had me confused for a bit. And you would never use "charset=wtf-8" (as in the title for this page) because the spec says:

"Any WTF-8 data must be converted to a Unicode encoding at the system’s boundary before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text in a file format or for transmission over the Internet."

account42 7 months ago | parent [-]

Some specs like to claim things that are out of their jurisdiction. A format spec has no say on where that format is used. It's best to ignore such hubris.

And in this particular case it doesn't even make sense because the entire point is to round trip WTF-16. If that requires one "system" to communicate WTF-8 with another "system" (which is really an arbitrary boundary) then so be it. And anything that expects UTF-8 will need to deal with "invalid" data in a usecase-dependent way anyway.