Remix.run Logo
naikrovek 15 hours ago

Go programmers (and `range`) assume that string is always valid UTF-8 but there is no guarantee by the language that a string is valid UTF-8. The string itself is still a []byte. `range` sees the `string` type and has special handling for strings that it does not have when it ranges over []byte. Recall that aliased types are not viewed as the same type at any time.

A couple quotes from the Go Blog by Rob Pike:

> It’s important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

> Besides the axiomatic detail that Go source code is UTF-8, there’s really only one way that Go treats UTF-8 specially, and that is when using a for range loop on a string.

Both from https://go.dev/blog/strings

If you want UTF-8 in a guaranteed way, use the functions available in unicode/utf8 for that. Using `string` is not sufficient unless you make sure you only put UTF-8 into those strings.

If you put valid UTF-8 into a string, you can be sure that the string holds valid UTF-8, but if someone else puts data into a string, and you assume that it is valid UTF-8, you may have a problem because of that assumption.