>We’ve seen four different lengths so far:

Number of UTF-8 code units (17 in this case) Number of UTF-16 code units (7 in this case) Number of UTF-32 code units or Unicode scalar values (5 in this case) Number of extended grapheme clusters (1 in this case)

We would not have this problem if we all agree to return number of bytes instead.

Edit: My mistake. There would still be inconsistency between different encoding. My point is, if we all decided to report number of bytes that string used instead number of printable characters, we would not have the inconsistency between languages.

▲

curtisf 7 days ago | parent | next [-]

"number of bytes" is dependent on the text encoding.

UTF-8 code units _are_ bytes, which is one of the things that makes UTF-8 very nice and why it has won

	▲	ivanjermakov 6 days ago \| parent [-]
		I would say Unicode has won, but not UTF-8. UTF-16 is also widely used due to its efficiency on asian texts.

▲

charcircuit 7 days ago | parent | prev | next [-]

>Number of extended grapheme clusters (1 in this case)

Only if you are using a new enough version of unicode. If you were using an older version it is more than 1. As new unicode updates come out, the number of grapheme clusters a string has can change.

▲

minebreaker 7 days ago | parent | prev | next [-]

> We would not have this problem if we all agree to return number of bytes instead.

I don't understand. It depends on the encoding isn't it?

▲

com2kid 7 days ago | parent | prev | next [-]

How would that help? UTF-8, 16, and 32 languages would still report different numbers.

▲

jibal 7 days ago | parent | prev | next [-]

> if we all decided to report number of bytes that string used instead number of printable characters

But that isn't the same across all languages, or even across all implementations of the same language.

▲

baq 7 days ago | parent | prev [-]

when I'm reading text on a screen, I very much am not reading bytes. this is obvious when you actually think what 'text encoding' means.

▲

account42 7 days ago | parent [-]

You're not reading unicode code points either though. Your computer uses bytes, you read glyphs which roughly correspond to unicode extended grapheme clusters - anything between might look like the correct solution at first but is the wrong abstraction for almost everything.

	▲	baq 7 days ago \| parent [-]
		you are right, but this just drives the point.