| ▲ | senfiaj 6 hours ago |
| I wonder, why not use a string buffer paired with its length? For example, maybe use struct that has char pointer, and 2 ints (occupied length + total buffer length). Almost like c++'s std::string. This null terminator thing really sucks, it's potentially insecure and often unperformant. |
|
| ▲ | maxlybbert 2 hours ago | parent | next [-] |
| It's definitely possible. And common, at least in some projects. The only real drawback is that sloppiness will lead to multiple slightly different nonstandard string types in the same project. |
|
| ▲ | bnolsen 4 hours ago | parent | prev | next [-] |
| That's called a fat pointer. Null terminated c strings is the majority of memory errors out there. |
|
| ▲ | WalterBright 4 hours ago | parent | prev | next [-] |
| Wonder no longer! https://dlang.org/spec/arrays.html#dynamic-arrays and https://dlang.org/spec/arrays.html#strings and for C: https://digitalmars.com/articles/C-biggest-mistake.html |
|
| ▲ | GalaxyNova 6 hours ago | parent | prev | next [-] |
| Yes I have seen it happen a few times with `strlen` being called in a loop silently causing O(N) to turn to O(N^2) |
| |
| ▲ | jkrejcha 5 hours ago | parent | next [-] | | Reminds me of an article[1] that described how he cut GTA Online loading times by 70% because strlen was getting called for effectively every character in a string [1]: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times... | | |
| ▲ | sweetjuly 5 hours ago | parent [-] | | I remember reading this blog post when it was first published, but the subsequent updates are better than I would've ever expected this to turn out. Worth checking it out again if you've seen it before :) |
| |
| ▲ | senfiaj 6 hours ago | parent | prev | next [-] | | Exactly, you can't write clean concise code when working with c strings. Almost every c string manipulation requires cognitive load: "Is the buffer size enough (including null terminator), should I reallocate it?", "I need to have the offset from the last concat, to make next concats performant", "Umm, shold I put null terminator at i or i + 1?"... It really sucks, it's akin to death by thousands of cuts. | |
| ▲ | sgerenser 4 hours ago | parent | prev [-] | | Joel Spolsky coined the term “Shlemiel the Painter’s Algorithm” for this type of thing back in 2001: https://www.joelonsoftware.com/2001/12/11/back-to-basics/ |
|
|
| ▲ | none_to_remain 6 hours ago | parent | prev | next [-] |
| The size overhead of that is 2*sizeof(int) while the overhead of null termination is sizeof(char). If I remember the standard right, the former is worse by at least sizeof(char), and usually more in practice. This used to matter, sometimes still does. |
| |
| ▲ | kgeist 5 hours ago | parent | next [-] | | I would assume the difference is mostly negligible in practice due to the allocator rounding up the allocated memory size at least by the word size anyway (for alignment and simpler bookkeeping). You can also use variable-length encoding in the header to use 1 byte for most cases, similar to how UTF-8 does it: if the most significant bit is not set, we assume a 7-bit encoding, which can represent string lengths up to 127 using 1 byte, which is probably 99% of strings. | |
| ▲ | senfiaj 6 hours ago | parent | prev [-] | | Well, not saying to always use it, but if the string size is big enough, the overhead of 2 ints becomes relatively vanishing. For generic dynamically sized strings it probably has more advantages than disadvantages. But in any case, sure, if every single byte matters or some structure requires specific memory layout, then fine. I just don't think these things are the majority of use cases. Keep in mind that the cached lengths can increase performance, since you don't have to recalculate string lengths. |
|
|
| ▲ | MBCook 4 hours ago | parent | prev | next [-] |
| A lot of them are strings coming from or going to user space right? So wouldn’t you have to do constant conversions? |
|
| ▲ | chiph 6 hours ago | parent | prev [-] |
| Pascal did/does this, but eventually someone wants a string longer than the size portion can handle. Or wants the number of characters not the number of bytes. |
| |
| ▲ | jerf 5 hours ago | parent | next [-] | | I wasn't a programmer in these days, so I don't know if there's some other major concern that would kill this, but I sometimes wonder about whether we could have / should have used variable-length integers. That is, something like, 0-127 byte strings get their length prefixed, 128 - 16383 get two bytes of prefix, and the probably-rare 16384 - 2097151 strings would end up with three, though proportionally by that point it's hardly anything. Or you could use the UTF-8 mechanism for packing the bytes, though that costs more and probably doesn't get anything we'd care about in the 1980s or 1990s. It's a bit of extra code, yes. Not necessarily all that much, but some. On average it is only slightly more expensive than null termination, and considered as a proportion of the size of the strings themselves it's hardly anything. It's probably better than the strings getting hard-limited to 0-255, though, which was quite frequently a user-visible quirk. | |
| ▲ | Johanx64 4 hours ago | parent | prev [-] | | Dude, every sane language out there does this. Just generally with 4byte prefix. Null-terminated stuff has always been backwards compat stuff. Pascal strings - historically and why people even remember this being an issue - were up to 255 chars in size, if not you had to use different string type. You might still want raw pointers for all sorts of low level stuff, but you almost never want to have null-terminated strings for anything but back-compat, one of the worst things ever, even on memory constrained systems. |
|