An open secret in our field is: the current market leading OSes and (to some extent) system architectures are antiquated and sub-optimal at their foundation due to backward compatibility requirements.

If we started green field today and managed to mitigate second system syndrome, we could design something faster, safer, overall simpler, and easier to program.

Every decent engineer and CS person knows this. But it’s unlikely for two reasons.

One is that doing it while avoiding second system syndrome takes teams with a huge amount of both expertise and discipline. That includes the discipline to be ruthless about exterminating complexity and saying no. That’s institutionally hard.

The second is that there isn’t strong demand. What we have is good enough for what most of the market wants, and right now all the demand for new architecture work is in the GPU/NPU/TPU space for AI. Nobody is interested in messing with the foundation when all the action is there. The CPU in that world is just a job manager for the AI tensor math machine.

Quantum computing will be similar. QC will be controlled by conventional machines, making the latter boring.

We may be past the window where rethinking architectural choices is possible. If you told me we still had Unix in 2000 years I would consider it plausible.

▲

nine_k 3 days ago | parent | next [-]

Aerospace, automotive, and medical devices represent a strong demand. They sometimes use and run really interesting stuff, due to the lack of such a strong backwards-compatibility demand, and a very high cost of software malfunction. Your onboard engine control system can run an OS based on seL4 with software written using Ada SPARK, or something. Nobody would bat an eye, nobody needs to run 20-years-old third-party software on it.

▲

bri3d 3 days ago | parent [-]

I don’t think these devices represent a demand in the same way at all. Secure boot firmware is another “demand” here that’s not really a demand.

All of these things, generally speaking, run unified, trusted applications, so there is no need for dynamic address space protection mechanisms or “OS level” safety. These systems can easily ban dynamic allocation, statically precompute all input sizes, and given enough effort, can mostly be statically proven given the constrained input and output space.

Or, to make this thesis more concise: I believe that OS and architecture level memory safety (object model addressing, CHERI, pointer tagging, etc.) is only necessary when the application space is not constrained. Once the application space is fully constrained you are better off fixing the application (SPARK is actually a great example in this direction).

Mobile phones are the demand and where we see the research and development happening. They’re walled off enough to be able to throw away some backwards compatibility and cross-compatibility, but still demand the ability to run multiple applications which are not statically analyzed and are untrusted by default. And indeed, this is where we see object store style / address space unflattening mitigations like pointer tagging come into play.

▲

nine_k 2 days ago | parent [-]

In a way, I agree. If you can verify the entire system throughout, you can remove certain runtime checks, such as the separation between the OS and tasks. If you have only one program to run, you can use a unikernel.

I suspect that specifically car / aircraft / spacecraft computers receive regular updates, and these updates change the smallest part they can. So they have separate programs / services running on top of a more general OS. The principles of defense in depth requires that each component should be hardened separately, to minimize the blast radius if a bug slips in.

	▲	bri3d 2 days ago \| parent [-]
		> I suspect that specifically car / aircraft / spacecraft computers receive regular updates, and these updates change the smallest part they can. In the space I am very familiar with, automotive, this is not true for code changes to most automotive control units; the "application software" code for each control unit is treated as a single entity and built, supplied, and modified at this level of granularity. Infotainment and digital cockpit is the only major exception, but even then, only for the "unsafe" part (Linux/QNX/Windows); the "safe" part is usually a single-image single-application running on a safety processor alongside. Sometimes personalization/vehicle-specific "data sets" or calibration _data_ (ie ECU tunes) can be updated without updating the application software, but the application software for each unit is generally treated as a large unified firmware blob. For example in every ECU I am aware of, modifying the application software logic (which is usually modeled in something like like Simulink/ASCET, not code directly) triggers a full code regeneration, recompilation, and generates a complete new firmware image with an updated Application Software version. There isn't any notion of shipping a new "turbocharger control" code module, or a new "diagnostics" code module, or whatever, even if they are constructed at this granularity in the code generation suite or run at this task granularity in the RTOS.

▲

matu3ba 3 days ago | parent | prev | next [-]

> we could design something faster, safer, overall simpler, and easier to program

I do remain doubtful on this for general purpose computing principles: Hardware for low latency/high throughput is at odds with full security (absence of observable side-channels). Optimal latency/throughput requires time-constrained=hardware programming with FGPAs or building hardware (high cost) usually programmed on dedicated hardware/software or via things like system-bypass solutions. Simplicity is at odds with generality, see weak/strong formal system vs strong/weak semantics.

If you factor those compromises in, then you'll end up with the current state plus historical mistakes like missing vertical system integration of software stacks above Kernel-space as TCB, bad APIs due to missing formalization, CHERI with its current shortcomings, etc.

I do expect things to change once security with mandatory security processor becomes more required leading to multi-CPU solutions and potential for developers to use on the system complex+simple CPUs, meaning roughly time-accurate virtual and/or real ones.

> The second is that there isn’t strong demand.

This is not true for virtualization and security use cases, but not that obvious yet due to missing wide-spread attacks, see side-channel leaks of cloud solutions. Take a look at hardware security module vendors growth.

▲

themafia 3 days ago | parent | prev | next [-]

> That includes the discipline to be ruthless about exterminating complexity and saying no. That’s institutionally hard.

You need to make a product that out-performs your competitors. If their chip is faster then your work will be ignored regardless of how pure you managed to keep it.

> We may be past the window where rethinking architectural choices is possible.

I think your presumption that our architectures are extremely sub-optimal is wrong. They're exceptionally optimized. Just spend some time thinking about branch prediction and register renaming. It's a steep cliff for any new entrant. You not only have to produce something novel and worthwhile but you have to incorporate decades of deep knowledge into the core of your product, and you have to do all of that without introducing any hardware bugs.

You stand on the shoulders of giants and complain about the style of their footwear.

▲

api 3 days ago | parent [-]

That’s another reason current designs are probably locked in. It’s called being stuck at a local maximum.

I’m not saying what we have is bad, just that the benefit of hindsight reveals some things.

Computing is tougher than other areas of engineering when it comes to greenfielding due to the extreme interlocking lock-in effects that emerge from things like instruction set and API compatibility. It’s easier to greenfield, say, an engine or an aircraft design, since doing so does not break compatibility with everything. If aviation were like computing, coffee mugs from propeller aircraft would fail to hold coffee (or even be mugs) on a jet aircraft.

▲

mike_hearn 3 days ago | parent [-]

Aviation does have a lot of backwards compatibility problems. It's one reason Boeing kept revving the 737 to make the Max version. The constraints come from things like training, certification, runway length, fuel mixes, radio protocols, regulations...

	▲	api 8 hours ago \| parent [-]
		I still think those are less "hard" and non-negotiable than API and architecture lock-in in computing. The physical world in general is more analog, more open to gradual and "soft" change. Digital things work or do not work, so any change to the foundation just hard-breaks the universe.

▲

foobarian 3 days ago | parent | prev | next [-]

> something faster

How true is this, really? When does the OS kernel take up more than a percent or so of a machine's resources nowadays? I think the problem is that there is so little juice there to squeeze that it's not worth the huge effort.

▲

api 3 days ago | parent | next [-]

The problem isn’t direct overhead. The problem is shit APIs like blocking I/O that we constantly have to work around via heroic extensions like io_uring, an inefficient threading model that forces every app to roll its own scheduler (async etc.), lack of OS level support for advanced memory management which would be faster than doing it in user space, etc.

	▲	3 days ago \| parent [-]
		[deleted]

▲

sph 3 days ago | parent | prev [-]

Look behind the curtains, and the space for improvement over the UNIX model are enormous. Our only saving grace is that computers have gotten ridiculously fast.

▲

int_19h 3 days ago | parent | prev [-]

The thing about AI though is that it has indirect effects down the line. E.g. as prevalence of AI-generated code increases, I would argue that we'll need more guardrails both in development (to ground the model) and at runtime (to ensure that when it still fails, the outcome is not catastrophic).