| ▲ | happytoexplain a day ago |
| > poorly designed String API Nope nope nope. I have to agree strongly with my sibling commenter. Every other language gets it horribly wrong. In app dev (Swift's primary use case), strings are most often semantically sequences of graphemes. And, if you at all care about computer science, array subscripting must be O(1). Swift does the right thing for both requirements. Beautiful. OK, yes, maybe they should add a native `nthCharacter(n:)`, but that's nitpicking. It's a one-liner to add yourself. |
|
| ▲ | tialaramex a day ago | parent | next [-] |
| I don't think Rust gets this horribly wrong. &str is some bytes which we've agreed are UTF-8 encoded text. So, it's not a sequence of graphemes, though it does promise that it could be interpreted that way, and it is a sequence of bytes but not just any bytes. In Rust "AbcdeF"[1] isn't a thing, it won't compile, but "AbcdeF"[1..=1] says we want the UTF-8 substring starting from byte 1 through to byte 1 and that compiles, and it'll work because that string does have a valid UTF-8 substring there, it's "b" -- However it'll panic if we try to "€300"[1..=1] because that's no longer a valid UTF-8 substring, that's nonsense. For app dev this is too low level, but it's nice to have a string abstraction that's at home on a small embedded device where it doesn't matter that I can interpret flags, or an emoji with appropriate skin tones, or whatever else as a distinct single grapheme in Unicode, but we would like to do a bit better than "Only ASCII works in this device" in 2025. |
| |
| ▲ | Someone a day ago | parent | next [-] | | > I don't think Rust gets this horribly wrong > In Rust "AbcdeF"[1] isn't a thing, it won't compile, but "AbcdeF"[1..=1] says we want the UTF-8 substring starting from byte 1 through to byte 1 and that compiles, and it'll work because that string does have a valid UTF-8 substring there, it's "b" -- However it'll panic if we try to "€300"[1..=1] I disagree. IMO, an API that uses byte offsets to substring on Unicode code points (or even larger units?) already is a bad idea, but then, having it panic when the byte offsets do not happen to be code point/(extended) grapheme cluster boundaries? How are you supposed to use that when, as you say ”we would like to do a bit better than "Only ASCII works in this device" in 2025”? I see there’s a better API that doesn’t throw (https://doc.rust-lang.org/std/primitive.str.html#method.get), but that IMO, still isn’t as nice as Swift’s choice because it still uses byte offsets | | |
| ▲ | tialaramex a day ago | parent [-] | | > How are you supposed to use that [...]? It's often the case that we know where a substring we want starts and ends, so this operation makes sense - because we know there's a valid substring this won't panic. For example if we know there's a literal colon at bytes 17 and 39 in our string foo, foo[18..39] is the UTF-8 text from bytes 18 to 38 inclusive, representing the string between those colons. One source of confusion here, is not realising that UTF-8 is a self-synchronising encoding. There are a lot of tricks that are correct and fast with UTF-8 but would be a disaster in the other multi-byte encodings or if (which is never the case in Rust) this isn't actually a UTF-8 string. |
| |
| ▲ | zzo38computer a day ago | parent | prev [-] | | You can do better than "only ASCII works in this device", and making the default string type to be Unicode is the wrong way to do that. For some applications, you might not need to interpret text at all, or you might need to only interpret ASCII text even if the text is not necessarily purely ASCII; other times you will want to do other things, but Unicode is not a very good character set (there are others but what is appropriate will depend much on the specific application in use; sometimes none are appropriate), and even if you are using Unicode you still don't need a Unicode string type, and you do not need it to check for valid UTF-8 for every string operation by default, because that will result in inefficiency. | | |
| ▲ | tialaramex 19 hours ago | parent [-] | | In 1995 what you describe isn't crazy. Who knows if this "Unicode" will go anywhere. In 2005 it's rather old-fashioned. There's lots of 8859-1 and cp1252 out there but people aren't making so much of it, and Unicode aka 10646 is clearly the future. In 2015 it's a done deal. Here we are in 2025. Stop treating non-Unicode text as anything other than an aberration. You don't need checks "for every string operation". You need a properly designed string type. |
|
|
|
| ▲ | ks2048 a day ago | parent | prev [-] |
| I think using "extended grapheme clusters" (EGC) (rather than code points or bytes) is a good idea. But, why not let you do "x[:2]" (or "x[0..<2]") for s String with the first two EGCs? (maybe better yet - make that return "String?") |
| |
| ▲ | ezfe a day ago | parent | next [-] | | Because that implies that String is a random access collection. You cannot constant-time index into a String, so the API doesn't allow you to use array indexing. If you know it's safe to do you can get a representation as a list of UInt8 and then index into that. | |
| ▲ | zzo38computer a day ago | parent | prev [-] | | I disagree. I think it should be indexed by bytes. One reason is what the other comment explains about not being constant-time (which is a significant reason), although the other is that this restricts it to Unicode (which has its own problems) and to specific versions of Unicode, and can potentially cause problems when using a different version of Unicode. A separate library can be used to deal with code points and/or EGC if this is important for a specific application; these features should not be inherent to the string type. | | |
| ▲ | novok a day ago | parent | next [-] | | In practice, that is tiring as hell, verbose, awkward, unintuitive, requiring types attached to a specific instance for characters to do numeric indexing anyway and a whole bunch of other unnecessary ceremony not required in other languages. We don't care that it takes longer, we all know that, we still need to do a bunch of string operations anyway, and it's way worse with swift than to do an equivalent thing than it is than pretty much any other language. | |
| ▲ | ks2048 a day ago | parent | prev [-] | | I don't think you can separate String from Unicode - that's what a "String" is in Swift. | | |
| ▲ | zzo38computer a day ago | parent [-] | | In Swift (and in other programming languages) it does use Unicode, but I think it probably would be better not to be. But, even when there is a Unicode String type, I still think that it probably should not be grapheme clusters, in my opinion; I explained some of the reasons for this. |
|
|
|