Why didn’t you include “Unicode Scalars”, aka “well-formed UTF-8”, aka “the Swift string type?”

Either way, I think the bitter lesson is a parser really can’t rely on the well-formedness of a Unicode string over the wire. Practically speaking, all wire formats are potentially ill-formed until parsed into a non-wire format (or rejected by same parser).

▲

csande17 4 days ago | parent | next [-]

IMO if you care about surrogate code points being invalid, you're in "designing the system around UTF-16" territory conceputally -- even if you then send the bytes over the wire as UTF-8, or some more exotic/compressed format. Same as how "potentially ill-formed UTF-16" and WTF-8 have the same underlying model for what a string is.

▲

dcrazy 4 days ago | parent [-]

The Unicode spec itself is designed around UTF-16: the block of code points that surrogate pairs would map to are reserved for that purpose and explicitly given “no interpretation” by the spec. [1] An implementation has to choose how to behave if it encounters one of these reserved code points in e.g. a UTF-8 string: Throw an encoding error? Silently drop the character? Convert it to an Object Replacement character?

[1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

▲

duckerude 4 days ago | parent [-]

RFC 3629 says surrogate codepoints are not valid in UTF-8. So if you're decoding/validating UTF-8 it's just another kind of invalid byte sequence like a 0xFF byte or an overlong encoding. AFAIK implementations tend to follow this. (You have to make a choice but you'd have to make that choice regardless for the other kinds of error.)

If you run into this when encoding to UTF-8 then your source data isn't valid Unicode and it depends on what it really is if not proper Unicode. If you can validate at other boundaries then you won't have to deal with it there.

	▲	account42 2 days ago \| parent [-]
		> You have to make a choice but you'd have to make that choice regardless for the other kinds of error. If you don't actively make a choice then decoding al la WTF-8 comes natural. Anything else is going to need additional branches.

▲

layer8 4 days ago | parent | prev [-]

There is no disagreement that what you can receive over the wire can be ill-formed. There is disagreement about what to reject when it is first parsed at a point where it is known that it should be representing a Unicode string.