Remix.run Logo
sheept 13 hours ago

UTF-32 allows for constant time character accesses, which means that mystr[i] isn't O(n). Most other languages can only provide constant time access for code units.

msl 5 hours ago | parent [-]

UTF-32 allows for constant time access to code points. Neither UTF-8 nor UTF-16 can do the same (there are 2 to the power of 20 valid code points, though not all are in use).

While most characters might be encodable as a single code point, Python does not normalize strings, so there is no guarantee that even relatively normal characters are actually stored as single code points.

Try this in Python:

  s = "a\u0308"
  print(s)
  print(s[0])
You will see:

  ä
  a