Remix.run Logo
mrlonglong 9 hours ago

the zero terminated string is I think is computing's biggest mistake. Pascal style strings were much safer.

BobbyTables2 2 hours ago | parent | next [-]

Partly agree but there would have been squabbling on the data type of the size, unless it was variable length. The latter would have had other issues too.

For a while, 16bit would probably have seemed too extravagant. Now 32bit would probably seem too small.

For a “strongly typed” language, C is pretty damn loose where would have mattered.

DarkUranium an hour ago | parent | next [-]

I like the D approach where arrays are just `struct { size_t length; T* ptr; }` internally --- and strings are just arrays of `immutable(char)`.

It has a big advantage over the Pascal approach in that you can do zero-copy slicing, since the length is separate from the actual data.

And `size_t` makes perfect sense for the length here. If your strings are longer than the address space (which `size_t` technically isn't, but is practically very strongly correlated to it), then you're going to have a problem regardless of the number of bits for the length anyway.

poly2it an hour ago | parent | prev [-]

No, there would not have been and this is most likely not the reason. size_t exists for precisely this use case. It has existed since C89.

smackeyacky 3 hours ago | parent | prev | next [-]

Zero terminated strings were the basis for an awful lot of useful software. Calling them the biggest mistake in computing is a bit OTT.

I haven’t programmed anything Pascal related for 30+ years but I dimly remember thinking at the time that I wished the string system wasn’t so hard to use.

asdfasgasdgasdg 3 hours ago | parent | next [-]

That useful software would not have been less useful if the strings in it were represented as size + buf.

ComputerGuru 2 hours ago | parent | prev [-]

That argument isn’t valid. The argument would be “this string design enabled a whole lot of useful software” but that’s a different matter. (And it could very well be the case.)

layer8 4 hours ago | parent | prev | next [-]

Almost as bad as newline-terminated lines. ;)

dmazzoni 2 hours ago | parent | prev | next [-]

255 characters ought to be enough for everybody, right?

bsder 6 hours ago | parent | prev | next [-]

Zero terminated string is a special case of sentinel value termination.

And sentinel value terminations make a lot of sense when you have punch cards and fixed length records that you need to carve into pieces.

Nobody expected any decisions they were making in the 1960s and 1970s to have any bearing on computing a half-century later. They all expected to have their mistakes long papered over by smarter people at some point.

But we ALL make the mistake of underestimating inertia.

jackbucks 8 hours ago | parent | prev | next [-]

It was definitely an interesting way to allocate pointers. I did once have a very large project where devs didnt understand this and resolved hundreds or more off by one and memory overwrites in C due to this feature.

But at the same time, I think blaming the software was kind of a cop out. Devs were in a hurry and simply didnt respect the rules. Given todays software engineer at large. Nerfing programming languages so they cant destroy things might not be a bad idea. But AI will nerf everything.

fragmede 8 hours ago | parent [-]

why is AI gonna nerf everything? sure it could be used as the easy button, but I just spent two hours this morning learning about the neuroscience of how memory works in the brain that I didn't mean to and now I want to run studies on how memory works.

Why do you assume that AI is gonna nerf everything?

AnimalMuppet 6 hours ago | parent [-]

AGI might. AI? No way.

See, AI was trained on existing data - on all that existing C code out there (sure, and also on all the papers and articles saying what was wrong with that C code). Those bugs are in the training data, and often not marked as bugs. So when AI generates C code, is it going to avoid making the mistakes that human code made? No, it's going to generate the kind of code it was trained on. How could it be otherwise?

That's not going to nerf anything.

deathanatos 4 hours ago | parent | next [-]

> See, AI was trained on existing data - on all that existing C code out there (sure, and also on all the papers and articles saying what was wrong with that C code). Those bugs are in the training data, and often not marked as bugs. So when AI generates C code, is it going to avoid making the mistakes that human code made? No, it's going to generate the kind of code it was trained on. How could it be otherwise?

The generalization of this is why I think all these AI companies writing blog posts where the marketing department is just jer—ranting endlessly about how AI will improve itself into the singularity is just crazy talk. They generate a random statistically likely output, and the most statistically likely output is mid. Exceptional outputs — the ones that wow us or move the needle are exactly that, unlikely. AGI is sci-fi, and LLMs will not change that.

You can see the same effect when AI emits bash, too, and especially so since most bash is terrible, and most users of bash do not put in the effort to learn bash and its foibles. So it outputs what most people write, which is not great.

AnimalMuppet 4 hours ago | parent [-]

It still could happen, if they had a way to judge the exceptional outputs from the mid and terrible ones. But I'm not sure they have that...

ComputerGuru 2 hours ago | parent [-]

In far from an AI fanatic, but I would argue training it on GitHub PRs and general software patches already provides that. Instead of just seeing the static snapshot it sees “this code was replaced by this (hopefully better) code”

CamperBob2 5 hours ago | parent | prev [-]

When's the last time you saw a decent coding model create a buffer-overflow bug while trying to use C strings?

Serious question. Anyone else seen this happen in the last 12-18 months? If so, which model and version were you using?

smj-edison 16 minutes ago | parent | next [-]

I use Zig, which has slices, so so far none. But man, it can't get ref counting right to save its life. There have been remarkably few times it's gotten it right on the first try. My codebase considers OOM recoverable, so it keeps forgetting to clean up memory when OOM is raised. Even in the happy path though it still messes up ref counting. I use Kimi k2.6.

smackeyacky 3 hours ago | parent | prev | next [-]

I had Claude write a bit of stupid C# the other day that had an off by one string truncate. Surprised the hell out of me.

krupan 3 hours ago | parent | prev | next [-]

How many people are writing C code with LLMs? I get the impression it's mostly JavaScript web apps

CamperBob2 2 hours ago | parent [-]

All the time. C, C++, occasionally some VHDL or Verilog.

macintux 4 hours ago | parent | prev [-]

Would you even know? Serious question. The volume of code the models can produce, the subtle ways these bugs can manifest (or even only manifest when under attack), it seems like they would be easy to overlook.

CamperBob2 4 hours ago | parent [-]

I have a habit of getting GPT 5.5 to review everything Opus writes for me, and vice versa. The model in the reviewer role frequently finds things I overlooked myself. Occasionally in parts of the code I wrote.

No modern LLM has found any buffer overflow bugs in parts of my code that originated from another LLM. Again, though, they have found one or two that were my fault.

dietr1ch 8 hours ago | parent | prev | next [-]

I think it was NULL itself. It was a long way until we realised we don't want invalid values and could use the type system to help us use special values safely.

jkrejcha 6 hours ago | parent | next [-]

The problem here is that null kinda is consequential of intentional design of the type system itself. In this way, I do think that null was discovered, rather than invented. Remember, C is a kinda "portable assembler" so the constructs in it are based relatively closely to how low level data structures are mapped out in memory.

This is, and continues to be, an incredibly useful feature that makes C and C structs immensely useful concepts. Part of that does need an invalid value[1]. NULL is convenient for this and although there are some very weird JavaScript-trinity-meme-style consequences for this[2], it's such a useful concept that basically all languages that have the ability to construct pointers have a null pointer[3].

The alternative world looks like everyone inventing their own invalid values. Invalid, non-null, pointers are typically MUCH worse than null pointers for debuggability and security. If you unintentionally read/write/execute memory at 0x0 (by far the most common value for NULL), most operating systems will trap this, whereas may not necessarily if 0x12345678 is your invalid value.

[1]: Stuff like IA64 had NaT bits which were effectively an extra bit for what I assume to be this sorta thing. The problem with this is that it costs an extra bit. I don't really know much about IA64, but presumably [NaT 1] + [don't care] would be your null pointers here. I think?

[2]: Really what the standard, in my opinion, should have done is probably not make use of the null pointer UB for many different functions. A lot of compilers took the UB surrounding that to make incredibly dubious "optimizations" that broke stuff with zero actual performance benefit whatsoever

[3]: Yes, even Rust. Although some (again in my opinion) unfortunate design decisions made it so that C-Rust FFI isn't zero cost because of how it treats spans/slices

atherton94027 5 hours ago | parent | prev | next [-]

Genuinely curious, how would you handle cases where a value is unset without NULL? This is a legitimate case that happens a lot in eg data modeling

clnhlzmn 4 hours ago | parent | next [-]

The way we do it in modern languages with things like std::optional and even that is not the best example.

MBCook 4 hours ago | parent [-]

And higher level languages that works. But what do you do when you get down to low level C or assembly?

You basically end up with null/0 don’t you?

paavohtl 9 minutes ago | parent | next [-]

Rust is a significantly higher level language than C, but it can be used it almost all environments where C is used; provided there's a supported compiler target for it. In (safe) Rust, null is basically a guaranteed compiler optimization. Optional / nullable values are represented via Option<T>, which is a sum type of Some(T) and None. When a reference or other pointer-like value (e.g. Box<T>, an owned heap allocation) is wrapped in Option, the compiler can use the invalid bit patterns of T (such as null) to represent the None variant. This is called niche optimization.

So yes, it's nulls underneath, but the developer never has to think about them.

dietr1ch 23 minutes ago | parent | prev [-]

Eventually you end up with registers that probably allow for 2^N values. But the point is not thinking about the machine executing the instructions, but the construction on top of it that has a safer design.

Seeking performance we've been very prone to avoid abstractions and over and over again have shown why we need the safe abstractions.

pdimitar 4 hours ago | parent | prev | next [-]

Sum types, of course.

atherton94027 19 minutes ago | parent [-]

How are you going to build sum types in a way where you can interact with assembly or machine code? The CPU doesn't know about that stuff

jibal 4 hours ago | parent | prev [-]

They already said:

> use the type system to help us use special values safely

... but this is not the place to explain what a type system is or what sum types/maybe/optional/etc. are.

bellowsgulch 7 hours ago | parent | prev | next [-]

Compared to scripting languages with actual tagged types, C doesn't really have a type system, and that's readily apparent to anyone who has written C in the last 43 years and debugged a program written in it.

C pretends types exist with you, but once bytes hit the road, it's all real-life and segmentation faults.

DarkUranium an hour ago | parent | next [-]

By that logic, no natively-compiled language has a type system.

Though I should note that in a way, even some ISAs have one, what with e.g. separate float vs integer registers.

AlotOfReading 3 hours ago | parent | prev [-]

C actually does have a type system and it's one of the bigger issues with the language. If it didn't, unaligned pointers and signed overflow would be totally fine.

jkercher 8 hours ago | parent | prev [-]

Meh, I think NULL is fine in C. It's an extra, valid state to represent pointers at no cost. Unlike the more hand holdy languages, it's quite rare for a pointer in C to have the ability to be NULL since, more often than not, it's pointing at something known. It's actually quite rare to see NULL checks unless it's API code or something like that. I can see this being more of a problem in a managed language where anything can be NULL at any time.

bvrmn 6 hours ago | parent | next [-]

NULL as a concept is fine. Inability to declare something as non-null is not.

There is a huge gap between developer expectation "it's pointing at something known" and hard reality confirmed by zillions of CVE. That's the reason optionality is prevalent in modern languages and type checkers (python, typescript), nowdays even Java has sane non-nullable types.

kelnos 7 hours ago | parent | prev | next [-]

> to represent pointers at no cost

I wouldn't call "cause of bugs and security issues" "no cost".

> it's quite rare for a pointer in C to have the ability to be NULL

As a C programmer for more than 25 years, that is the exact opposite of my experience.

none_to_remain 6 hours ago | parent | prev | next [-]

Struct foo has various members, including a bar*. But a foo may or may not be associated with a bar. If there's no associated bar, the bar* pointer is NULL. Seen and done this all the time

UqWBcuFx6NV4r 6 hours ago | parent | prev | next [-]

This precise mindset is why the world has suffered for decades (wrt security/integrity/availability) at the hands of what can only be described as an industry led by completely unjustified male confidence. Why are there still people fighting the “it’s not that bad, guys! you’ve just got to be a good developer like ME!” fight?

IgorPartola 6 hours ago | parent [-]

Is None OK in Python?

NULL in C just doesn’t belong at the end of a string. But IMO having a “there is no value here” designation is not a bad thing.

none_to_remain 6 hours ago | parent | next [-]

I think you're mixing up the NULL pointer and the NULL (sometimes NUL) character.

jibal 3 hours ago | parent | prev [-]

Python is interpreted so None is always tested for and will throw an exception if used in the wrong context. This is quite different from a SEGVIO.

> But IMO having a “there is no value here” designation is not a bad thing.

Sure ... if it's done via the type system so that errors are caught at compile time. There's a reason that modern languages all either do this or are moving towards doing it. (And a reason that C programmers have no idea what we're talking about when we refer to type systems.)

> NULL in C just doesn’t belong at the end of a string.

Different discussion. (And NUL, not NULL.)

XorNot 6 hours ago | parent | prev [-]

The problem with let's get rid of NULL is that it's a real, required state. The vast majority of computing is actually not binary: any real input generally has at least 3 possible states: not set, true and false.

In practice really 4 because "indeterminate" is a reasonable error condition you'd like to know about.

And it keeps increasing anyway: e.g. not set has subcategories: not set due to lack of user input, not set because we're loading state from the backend etc.

NULL is the first expression of that basic problem: it's definitely not enough to eliminate NULL because the first thing which happens is your non pointer default value takes it's place.

lambdaone 5 hours ago | parent | next [-]

What you are describing is option types, which are an entirely valid and very useful construct that helps make programs more rather than less reliable. But you need proper language type system support and compile-time enforcement to make it work, and C does neither of those.

bnolsen 4 hours ago | parent [-]

C++ and rust make these optionals ugly. Zig does it right. Zig also forbids null pointers and requires use of optionals.

5 hours ago | parent | prev [-]
[deleted]
msla 8 hours ago | parent | prev | next [-]

In addition to having to pick a size for the length counter and then, later, having to differentiate between lengths in bytes, codepoints, and glyphs, you can't subdivide a Pascal string using pointer arithmetic. To pass just the end of a string into a function, you have to either copy the tail of one Pascal-style string to another with a smaller size value, or your string has to be a struct with an integer and a pointer to the actual data instead of just an integer stuck on the beginning of the string. The first is a lot of copying in some cases, the second raises the specter of structs with invalid pointers. That's not to mention the potential problems that would cause with caches.

cornholio 6 hours ago | parent | next [-]

You can have a universal variable length field, for example 2 bytes for strings < 32768, then four bytes, 8 bytes etc. On the critical short string path, it costs just a single bit test. The glyph vs byte issues need to be dealt with in both formats.

The subdivision issue is a good perspective, but i would argue the performance impact of cloning substrings is dwarfed by the redundant full string reads to find length.

estebank 6 hours ago | parent | prev [-]

The third option is to have a variable width length: the top most bit signals whether the next byte corresponds to the length or to the start of the string.

fragmede 6 hours ago | parent | prev | next [-]

compared to Von Newman versus Harvard architecture for LLMs? I think that's a far bigger mistake.

themafia 8 hours ago | parent | prev | next [-]

> Pascal style strings were much safer.

The limitations were brutal. Initially you could only have 255 bytes in a string. The length of a string and the size of the allocation are now separate and you may need to think about that unused memory in your design. The problem now doubles with the introduction of UTF-8. Your string size is in bytes and you need to track characters separately.

If you want to create an array of strings you either need to specify the length of all strings and accept the memory overhead or have an array of pointers to strings. If you use an array of pointers you may end up choosing to use the 'nil' value as a sentinel that means "end of list." So we're right back where we started.

--

Because someone decided to downvote this HN has limited the speed at which I can reply. This site is tragic and I'm fully done with it now. You can spread propaganda and poorly sourced zeitgeist and be among friends but if you try to have a genuine conversation about programming languages you are made to be unwelcome immediately. Screw this.

--

> No other data structure works like this.

The linked list.

> You can't mess this up in an array

C happily decomposes arrays into pointers. You can erase your length information from the type. This was an intentional decision.

> Strings are the only data structure that assume there will be a NULL at end.

Which is why almost every string API has a version that allows you to specify the maximum length. The fact that you can use a NUL doesn't mean you have to. Which is why the concept of "sentinel values" is broadly used in many types of applications you haven't considered here.

dare944 3 hours ago | parent | next [-]

> You can spread propaganda and poorly sourced zeitgeist and be among friends but if you try to have a genuine conversation about programming languages you are made to be unwelcome immediately.

Indeed. And the ignorance of computing history in this discussion is particularly disturbing.

The context of this particular thread is "zero terminated string is ... computing's biggest mistake". This completely ignores the situation on the ground when C was developed. At the time, people were striving for a system programming language that sat above the level of assembly but was compact enough to run within the limited resources of the then emerging mini-computer systems. The PDP-11 on which C was developed was certainly not the first mini-computer, but it was among the earliest to have a regular enough instruction set and addressing model to make a general purpose, high-level system's language possible. These systems were extremely limited in memory; the PDP-11's instruction set is limited to directly addressing at most 64KiB (code and data) and many systems of the era were hardware limited to less than that. (Indeed, I regularly run an early version of Unix, including an early C compiler, on my PDP-11/05 which is maxed out at 56KiB [of actual core]). There was no way that even a brilliant engineer like Dennis Richie was going to be able to shoe-horn in "optional" types, or the mechanics of length-value strings into a compiler that has to run in such limited space, and produce code (e.g. the Unix kernel) that has to run in even less. The fact that strings and arrays are thin abstractions on top of pointers is both a brilliant compromise in design as well as a nod to then-prevalent assembly practice. It was the exactly kind of pragmatic decision that was needed to move computing along at the time. Of course the designs from this era are antiquated now. But they were not mistakes.

BigTTYGothGF 6 hours ago | parent | prev | next [-]

> Your string size is in bytes and you need to track characters separately

No worse than C strings then.

AlienRobot 8 hours ago | parent | prev [-]

>The problem now doubles with the introduction of UTF-8. Your string size is in bytes and you need to track characters separately.

That isn't really a problem.

The problem with null-terminated strings is specifically what happens when you reach the end of the allocated array and there ISN'T a NULL character.

Every string function is designed to keep going until it finds the NULL character, so if a hacker gets rid of the NULL character, he can exploit pretty much any standard string manipulation function being used elsewhere in the program to manipulate whatever memory comes AFTER the string data structure.

No other data structure works like this. You can't mess this up in an array, because no function that manipulates arrays is just going to keep going until there is a null. That would be stupid because it would require users of the function to add a NULL to the end of their arrays before passing it to the function, so instead we just pass the size of the array to everything. Strings are the only data structure that assume there will be a NULL at end.

By the way, I read once that if you use UTF-32 every code point will be 4 bytes, constantly, but even then a single code point isn't necessarily a single character. Text is just complicated.

tredre3 7 hours ago | parent [-]

> No other data structure works like this.

In C most data structures work like this, you keep going until you find NUL (character) or NULL (pointer). E.g. Strings, array of pointers, linked lists, etc. Of course you can add length to most of those, but it isn't the canonical/traditional way of doing things.

AlienRobot 6 hours ago | parent [-]

That can't be true. If you have an array of pointers it can be terminated in NULL. But an array of integers can't have a NULL value, since NULL would probably be just 0 which is a normal integer.

The null in a linked list is the null in the .next field, right? That's the way you would implemented linked lists independent of language. It's not the .value that is null.

A string is an array of characters (well, for characters representable in one byte at least) that has a specific value to represent the end of string.

It would be like if Int::MAX was reduced by 1 to make space for an Int:NUL constant that represented the end of an integer array. Or if you were creating your own ENUM, let's say for NORTH, SOUTH, EAST, WEST, and you added a fifth enumeration called Direction.NUL for use in arrays.

jkrejcha 5 hours ago | parent [-]

With an variable length array of structs, you can set all the fields all to 0 at the cost of an extra member at the end. In the cases where this is, the structures are such that (either intentionally or by consequence) something with all fields zero is outside of the function's domain

A little bit related: https://devblogs.microsoft.com/oldnewthing/20091008-00/?p=16...

mikewarot an hour ago | parent | prev [-]

[dead]