| ▲ | RedShift1 a day ago |
| Which characters are not available in UTF-8 that warrant using WTF-8? |
|
| ▲ | badmintonbaseba a day ago | parent | next [-] |
| Invalid UTF-16 with unpaired surrogates. Or rather WTF-8 is an alternate encoding of UCS-2. The subset of UCS-2 that is valid UTF-16 encodes to valid UTF-8 when encoded with WTF-8. The encoding is invertible, valid UTF-8 decodes to valid UTF-16, otherwise any byte sequence decodes to UCS-2. |
|
| ▲ | chrismorgan a day ago | parent | prev [-] |
| Just read the abstract: > WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired. |
| |
| ▲ | RedShift1 a day ago | parent [-] | | Ok, but in practice, what does this mean for the characters? Are there certain characters unavailable? | | |
| ▲ | chrismorgan a day ago | parent | next [-] | | It’s the unpaired surrogate code points. That’s the whole thing. It’s about encoding ill-formed UTF-16, which is distressingly common in the real world. | |
| ▲ | numpad0 a day ago | parent | prev [-] | | broken emojis? There apparently are known issues that some frameworks break Unicode at wrong boundaries, maybe the author saw it regularize into a deeper mess | | |
| ▲ | masklinn a day ago | parent [-] | | It’s not just broken emoji, it’s straight up broken content: UTF-8 can not represent unpaired surrogates. WTF-8 is necessary for Rust’s compatibility with Windows filesystems (it underlines OsString on Windows) as e.g. file names are sequences of UTF-16 code units (and thus may contain unpaired surrogates). |
|
|
|