Remix.run Logo
ape4 4 days ago

Seems like libraries that serialize to JSON should have an option to filter out these bad characters.

layer8 4 days ago | parent | next [-]

No. As the RFC notes: “Silently deleting an ill-formed part of a string is a known security risk. Responding to that risk, Section 3.2 of [UNICODE] recommends dealing with ill-formed byte sequences by signaling an error or replacing problematic code points, ideally with "�" (U+FFFD, REPLACEMENT CHARACTER).”

I would almost always go for “signaling an error”.

Manfred 4 days ago | parent | prev | next [-]

My experience writing Unicode related libraries is that people don't use features when you have to explain why and when to use them. I assume that's why Tim puts the emphasis on "working on something new".

CharlesW 4 days ago | parent | prev | next [-]

This RFC and Go-language reference library is designed to be used by existing libraries that do serialization/sanitation/validation. This is hot off the press, so I'm sure Tim would appreciate it if you'd let your favorite library know it exists.

nikolayasdf123 4 days ago | parent [-]

interesting. isn't in Go it is just `unicode.IsPrint(r rune)`? https://pkg.go.dev/unicode#IsPrint

xdennis 4 days ago | parent | prev [-]

How is Unicode in any way related to JSON? JSON should just encode whatever dumb data someone wants to transport.

Unicode validation/cleanup should be done separately because it's needed in multiple places, not just JSON.

layer8 4 days ago | parent | next [-]

The contents of JSON strings doesn’t admit random binary data. You need to use an encoding like Base64 for that purpose.

recursive 4 days ago | parent | prev | next [-]

JSON is text. If you're not going to use unicode in the representation of your text, you'll need some other way.

dcrazy 4 days ago | parent | next [-]

The current JSON spec mandates UTF-8, but practically speaking encoding is a higher-level concept. I suspect there are many server implementations that will respect the Content-Encoding header in a POST request containing JSON.

ninkendo 4 days ago | parent | prev [-]

So?

All the letters in this string are “just text”:

    "\u0000\u0089\uDEAD\uD9BF\uDFFF"
JSON itself allows putting sequences of escape characters in the string that don’t unescape to valid Unicode. That’s fine, because the strings aren’t required to represent any particular encoding: it’s up to a layer higher than JSON to be opinionated about that.

I wouldn’t want my shell’s pipeline buffers to reject data it doesn’t like, why should a JSON serializer?

recursive 4 days ago | parent [-]

I actually agree, now that I understand what you're talking about.

zzo38computer 4 days ago | parent | prev [-]

JSON (unfortunately) requires strings to be Unicode. (JSON has other problems too, but Unicode is one of them.)