Remix.run Logo
naikrovek 4 days ago

I think maybe you've forgotten about the rune type. Rune does make assumptions.

[]Rune is for sequences of UTF characters. rune is an alias for int32. string, I think, is an alias for []byte.

TheDong 3 days ago | parent [-]

`string` is not an alias for []byte.

Consider:

    for i, chr := range string([]byte{226, 150, 136, 226, 150, 136}) {
      fmt.Printf("%d = %v\n", i, chr)
      // note, s[i] != chr
    }
How many times does that loop over 6 bytes iterate? The answer is it iterates twice, with i=0 and i=3.

There's also quite a few standard APIs that behave weirdly if a string is not valid utf-8, which wouldn't be the case if it was just a bag of bytes.

naikrovek 15 hours ago | parent [-]

Go programmers (and `range`) assume that string is always valid UTF-8 but there is no guarantee by the language that a string is valid UTF-8. The string itself is still a []byte. `range` sees the `string` type and has special handling for strings that it does not have when it ranges over []byte. Recall that aliased types are not viewed as the same type at any time.

A couple quotes from the Go Blog by Rob Pike:

> It’s important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

> Besides the axiomatic detail that Go source code is UTF-8, there’s really only one way that Go treats UTF-8 specially, and that is when using a for range loop on a string.

Both from https://go.dev/blog/strings

If you want UTF-8 in a guaranteed way, use the functions available in unicode/utf8 for that. Using `string` is not sufficient unless you make sure you only put UTF-8 into those strings.

If you put valid UTF-8 into a string, you can be sure that the string holds valid UTF-8, but if someone else puts data into a string, and you assume that it is valid UTF-8, you may have a problem because of that assumption.