▲ | Mlller 5 days ago | |
The article nearly equivocates “Rather Useless” and “unambiguously the worst”. Python3 seems more coherent to me than the article's argument: 1. Python3 plainly distinguishes between a string and a sequence of bytes. The function `len`, as a built-in, gives the most straightforward count: for any set or sequence of items, it counts the number of these items. 2. For a sequence of bytes, it counts the number of bytes. Taking this face-palming half-pale male hodgepodge and encoding it according to UTF-8, we get 17 bytes. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F".encode(encoding = "utf-8")) == 17`. 3. After bytes, the most basic entities are Unicode code points. A Python3 string is a sequence of Unicode code points. So for a Python3 string, `len` should give the number of Unicode code points. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F") == 5`. Anything more is and should be beyond the purview of the simple built-in `len`: 4. Grapheme clusters are complicated and nearly as arbitrary as code points, hence there are “legacy grapheme clusters” – the grapheme clusters of older Unicode versions, because they changed – and “tailored grapheme clusters”, which may be needed “for specific locales and other customizations”, and of course the default “extended grapheme clusters”, which are only “a best-effort approximation” to “what a typical user might think of as a “character”.” Cf. https://www.unicode.org/reports/tr29 Of course, there are very few use cases for knowing the number of code points, but are there really much more for the number (NB: the number) of grapheme clusters? Anyway, the great module https://pypi.org/project/regex/ supports “Matching a single grapheme \X”. So:
5. The space a sequence of code points will occupy on the screen: certainly useful but at least dependent on the typeface that will be used for rendering and hence certainly beyond the purview of a simple function. |