▲ | zahlman 4 days ago | |
>"Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type" "the Python string type" is neither "UTF-16" nor "well-formed", and there are very deliberate design decisions behind this. Since Python 3.3 with the introduction of https://peps.python.org/pep-0393/ , Python does not use anything that can be called "UTF-16" regardless of compilation options. (Before that, in Python 2.2 and up the behaviour was as in https://peps.python.org/pep-0261/ ; you could compile either a "narrow" version using proper UTF-16 with surrogate pairs, or a "wide" version using UTF-32.) Instead, now every code point is represented as a separate storage element (as they would be in UTF-32) except that the allocated memory is dynamically chosen from 1/2/4 bytes per element as needed. (It furthermore sets a flag for 1-byte-per-element strings according to whether they are pure ASCII or if they have code points in the 128..255 range.) Meanwhile, `str` can store surrogates even though Python doesn't use them normally; errors will occur at encoding time:
They're even disallowed for an explicit encode to utf-16:
But this can be overridden:
Which subsequently allows for decoding that automatically interprets surrogate pairs:
Storing surrogates in `str` is used for smuggling in binary data. For example, the runtime does it so that it can try to interpret command line arguments as UTF-8 by default, but still allow arbitrary (non-null) bytes to be passed (since that's a thing on Linux):
It does this by decoding with the same 'surrogateescape' error handler that the above diagnostic needs when re-encoding:
|