Remix clone Hacker News

new | show | ask | jobs Github

	▲	zahlman 4 days ago
		>"Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type" "the Python string type" is neither "UTF-16" nor "well-formed", and there are very deliberate design decisions behind this. Since Python 3.3 with the introduction of https://peps.python.org/pep-0393/ , Python does not use anything that can be called "UTF-16" regardless of compilation options. (Before that, in Python 2.2 and up the behaviour was as in https://peps.python.org/pep-0261/ ; you could compile either a "narrow" version using proper UTF-16 with surrogate pairs, or a "wide" version using UTF-32.) Instead, now every code point is represented as a separate storage element (as they would be in UTF-32) except that the allocated memory is dynamically chosen from 1/2/4 bytes per element as needed. (It furthermore sets a flag for 1-byte-per-element strings according to whether they are pure ASCII or if they have code points in the 128..255 range.) Meanwhile, `str` can store surrogates even though Python doesn't use them normally; errors will occur at encoding time: `>>> x = '\ud800\udc00' >>> x '\ud800\udc00' >>> print(x) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed` They're even disallowed for an explicit encode to utf-16: `>>> x.encode('utf-16') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-16' codec can't encode character '\ud800' in position 0: surrogates not allowed` But this can be overridden: `>>> x.encode('utf-16-le', 'surrogatepass') b'\x00\xd8\x00\xdc'` Which subsequently allows for decoding that automatically interprets surrogate pairs: `>>> y = x.encode('utf-16-le', 'surrogatepass').decode('utf-16-le') >>> y '𐀀' >>> len(y) 1 >>> ord(y) 65536` Storing surrogates in `str` is used for smuggling in binary data. For example, the runtime does it so that it can try to interpret command line arguments as UTF-8 by default, but still allow arbitrary (non-null) bytes to be passed (since that's a thing on Linux): `$ cat cmdline.py #!/usr/bin/python import binascii, sys for arg in sys.argv[1:]: abytes = arg.encode(sys.stdin.encoding, 'surrogateescape') ahex = binascii.hexlify(abytes) print(ahex.decode('ascii')) $ ./cmdline.py foo 666f6f $ ./cmdline.py 日本語 e697a5e69cace8aa9e $ ./cmdline.py $'\x01\x00\x02' 01 $ ./cmdline.py $'\xff' ff $ ./cmdline.py ÿ c3bf` It does this by decoding with the same 'surrogateescape' error handler that the above diagnostic needs when re-encoding: `>>> b'\xff'.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte >>> b'\xff'.decode('utf-8', 'surrogateescape') '\udcff'`