| ▲ | zahlman 8 hours ago | |
There is no caching of a "utf-8 representation". You may check for example:
Generally, the only reason this would happen implicitly is for I/O; actual operations on the string operate directly on the internal representation.Python uses either 8, 16 or 32 bits per character according to the maximum code point found in the string; uint8 is thus used for all strings representable in Latin-1, not just "ASCII". (It does have other optimizations for ASCII strings.) The reason for Windows being stuck with UTF-16 is quite easy to understand: backwards compatibility. Those APIs were introduced before there supplementary Unicode planes, such that "UTF-16" could be equated with UCS-2; then the surrogate-pair logic was bolted on top of that. Basically the same thing that happened in Java. | ||
| ▲ | cloudbonsai 5 hours ago | parent [-] | |
> There is no caching of a "utf-8 representation". No there certainly is. This is documented in the official API documentation:
In particular, Python's Unicode object (PyUnicodeObject) contains a field named utf8. This field is populated when PyUnicode_AsUTF8AndSize() is first called and reused thereafter. You can check the exact code I'm talking about here:https://github.com/python/cpython/blob/main/Objects/unicodeo... Is it clear enough? | ||