Remix.run Logo
Mlller 5 days ago

The article nearly equivocates “Rather Useless” and “unambiguously the worst”. Python3 seems more coherent to me than the article's argument:

1. Python3 plainly distinguishes between a string and a sequence of bytes. The function `len`, as a built-in, gives the most straightforward count: for any set or sequence of items, it counts the number of these items.

2. For a sequence of bytes, it counts the number of bytes. Taking this face-palming half-pale male hodgepodge and encoding it according to UTF-8, we get 17 bytes. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F".encode(encoding = "utf-8")) == 17`.

3. After bytes, the most basic entities are Unicode code points. A Python3 string is a sequence of Unicode code points. So for a Python3 string, `len` should give the number of Unicode code points. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F") == 5`.

Anything more is and should be beyond the purview of the simple built-in `len`:

4. Grapheme clusters are complicated and nearly as arbitrary as code points, hence there are “legacy grapheme clusters” – the grapheme clusters of older Unicode versions, because they changed – and “tailored grapheme clusters”, which may be needed “for specific locales and other customizations”, and of course the default “extended grapheme clusters”, which are only “a best-effort approximation” to “what a typical user might think of as a “character”.” Cf. https://www.unicode.org/reports/tr29

Of course, there are very few use cases for knowing the number of code points, but are there really much more for the number (NB: the number) of grapheme clusters?

Anyway, the great module https://pypi.org/project/regex/ supports “Matching a single grapheme \X”. So:

  len(regex.findall(r"\X", "\U0001F926\U0001F3FC\u200D\u2642\uFE0F")) == 1
5. The space a sequence of code points will occupy on the screen: certainly useful but at least dependent on the typeface that will be used for rendering and hence certainly beyond the purview of a simple function.