Remix.run Logo
CodesInChaos a day ago

> WTF-8 is a hack intended to be used internally in self-contained systems with components that need to support potentially ill-formed UTF-16 for legacy reasons.

> Any WTF-8 data must be converted to a Unicode encoding at the system’s boundary before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text in a file format or for transmission over the Internet.

I strongly disagree with that part. When you need to be able to serialize every possible Windows filename, WTF-8 is a great choice. This could be a backup tool, or an NTFS driver for Linux.

I also think rust's serde should always serialize OsString as a bytestring, using WTF-8 on Windows. Instead of the system dependent union of u16/u8 sequences it currently uses.

Rygian a day ago | parent | next [-]

The way I read the "Intended Audience", I think the use cases you mention are non-goals for WTF-8:

> There is no and will not be any encoding label [ENCODING] or IANA charset alias [CHARSETS] for WTF-8.

The goal is to ensure WTF-8 remains fully contained, so that ill-formed strings don't end up processed by systems that expect well-formed strings.

If you need to serialize every possible Windows filename, then you must also own the corresponding de-serializer (ie make your solution self-contained), and cannot expect users to work with the serialized contents using tools you do not control.

RedShift1 a day ago | parent | prev [-]

Which characters are not available in UTF-8 that warrant using WTF-8?

badmintonbaseba a day ago | parent | next [-]

Invalid UTF-16 with unpaired surrogates. Or rather WTF-8 is an alternate encoding of UCS-2. The subset of UCS-2 that is valid UTF-16 encodes to valid UTF-8 when encoded with WTF-8. The encoding is invertible, valid UTF-8 decodes to valid UTF-16, otherwise any byte sequence decodes to UCS-2.

chrismorgan a day ago | parent | prev [-]

Just read the abstract:

> WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.

RedShift1 a day ago | parent [-]

Ok, but in practice, what does this mean for the characters? Are there certain characters unavailable?

chrismorgan a day ago | parent | next [-]

It’s the unpaired surrogate code points. That’s the whole thing. It’s about encoding ill-formed UTF-16, which is distressingly common in the real world.

numpad0 a day ago | parent | prev [-]

broken emojis? There apparently are known issues that some frameworks break Unicode at wrong boundaries, maybe the author saw it regularize into a deeper mess

masklinn a day ago | parent [-]

It’s not just broken emoji, it’s straight up broken content: UTF-8 can not represent unpaired surrogates.

WTF-8 is necessary for Rust’s compatibility with Windows filesystems (it underlines OsString on Windows) as e.g. file names are sequences of UTF-16 code units (and thus may contain unpaired surrogates).