How NASA built Artemis II’s fault-tolerant computer

▲ How NASA built Artemis II’s fault-tolerant computer(cacm.acm.org)

177 points by speckx 14 hours ago | 62 comments

▲ dmk 5 hours ago | parent | next [-]

The quote from the CMU guy about modern Agile and DevOps approaches challenging architectural discipline is a nice way of saying most of us have completely forgotten how to build deterministic systems. Time-triggered Ethernet with strict frame scheduling feels like it's from a parallel universe compared to how we ship software now.

▲ carefree-bob 10 minutes ago | parent | next [-]

During the time of the first Apollo missions, a dominant portion of computing research was funded by the defense department and related arms of government, making this type of deterministic and WCET (worst case execution time) a dominant computing paradigm. Now that we have a huge free market for things like online shopping and social media, this is a bit of a neglected field and suffers from poor investment and mindshare, but I think it's still a fascinating field with some really interesting algorithms -- check out the work of Frank Mueller or Johann Blieberger.

▲ iknowstuff 2 hours ago | parent | prev | next [-]

Tesla’s Cybertruck uses that in its ethernet as well!

▲ vasco 20 minutes ago | parent | prev | next [-]

It's not like the approach they took is any different. Just slapped 8x the number of computers on it for calculating the same thing and wait to see if they disagree. Not the pinnacle of engineering. The equivalent of throwing money at the problem.

▲ dyauspitr an hour ago | parent | prev | next [-]

Agile is not meant to make solid, robust products. It’s so you can make product fragments/iterations quickly, with okay quality and out to the customer asap to maximize profits.

▲

nickff 38 minutes ago | parent [-]

“Agile” doesn’t mean that you release the first iteration, it’s just a methodology that emphasizes short iteration loops. You can definitely develop reliable real-time systems with Agile.

	▲	kermatt 2 minutes ago \| parent [-]
		> “Agile” doesn’t mean that you release the first iteration Someone needs to inform the management of the last three companies I worked for about this.

▲ mvkel 3 hours ago | parent | prev | next [-]

If you look at code as art, where its value is a measure of the effort it takes to make, sure.

	▲	stodor89 39 minutes ago \| parent \| next [-]
		Or if you're building something important, like a spaceship.
	▲	couchand 2 hours ago \| parent \| prev \| next [-]
		If your implication is that stencil art does not take effort then perhaps you may not fully appreciate Banksy. Works like Gaza Kitty or Flower Thrower don’t just appear haphazardly without effort.
	▲	BobbyTables2 an hour ago \| parent \| prev [-]
		In that case, our test infrastructure belongs in the Louvre…

▲ arduanika 3 hours ago | parent | prev | next [-]

You could even say that part of the value of Artemis is that we're remembering how to do some very hard things, including the software side. This is something that you can't fake. In a world where one of the more plausible threats of AI is the atrophy of real human skills -- the goose that lays the golden eggs that trains the models -- this is a software feat where I'd claim you couldn't rely on vibe code, at least not fully.

That alone is worth my tax dollars.

▲ ramraj07 4 hours ago | parent | prev [-]

I take the opposite message from that line - out of touch teams working on something so over budget and so overdue, and so bureaucratic, and with such an insanely poor history of success, and they talk as if they have cured cancer.

This is the equivalent of Altavista touting how amazing their custom server racks are when Google just starts up on a rack of naked motherboards and eats their lunch and then the world.

Lets at least wait till the capsule comes back safely before touting how much better they are than "DevOps" teams running websites, apparently a comparison that's somehow relevant here to stoke egos.

▲ danhon 4 hours ago | parent | next [-]

You mean like this?

"With limited funds, Google founders Larry Page and Sergey Brin initially deployed this system of inexpensive, interconnected PCs to process many thousands of search requests per second from Google users. This hardware system reflected the Google search algorithm itself, which is based on tolerating multiple computer failures and optimizing around them. This production server was one of about thirty such racks in the first Google data center. Even though many of the installed PCs never worked and were difficult to repair, these racks provided Google with its first large-scale computing system and allowed the company to grow quickly and at minimal cost."

https://blog.codinghorror.com/building-a-computer-the-google...

▲

ramraj07 3 hours ago | parent | next [-]

The problem they solved isn't easy. But its not some insane technical breakthrough either. Literally add redundancy, thats the ask. They didnt invent quantum computing to solve the issue did they? Why dunk on sprints?

▲

vlovich123 2 hours ago | parent [-]

Wow. What a hand wave away of the intrinsic challenge of writing fault tolerant distributed systems. It only seems easy because of decades of research and tools built since Google did it, but by no means was it something you could trivially add to a project as you can today.

	▲	tempest_ 2 hours ago \| parent [-]
		> fault tolerant distributed systems I mean there were mainframes which could be described as that. IBM just fixed it in hardware instead of software so its not like it was an unknown field.

▲

1970-01-01 3 hours ago | parent | prev [-]

Google then had complete regret not doing this with ECC RAM: https://news.ycombinator.com/item?id=14206811

	▲	newmana an hour ago \| parent \| next [-]
		A great version of this and how ex-DEC engineers saved Google and their choice of ECC RAM - inventing MapReduce and BigTable https://www.youtube.com/watch?v=IK0I4f8Rbis
	▲	ramraj07 3 hours ago \| parent \| prev [-]
		It got them to where they need to be to then worry about ECC. This is like the dudes who deploy their blog on kubernetes just in case it hits front page of new york times or something.

▲ bluegatty 3 hours ago | parent | prev | next [-]

No, space is just hard.

Everything is bespoke.

You need 10x cost to get every extra '9' in reliability and manned flight needs a lot of nines.

People died on the Apollo missions.

It just costs that much.

▲

arduanika 3 hours ago | parent | next [-]

Please, this is hacker news. Nothing else is hard outside of our generic software jobs, and we could totally solve any other industry in an afternoon.

▲

geerlingguy 3 hours ago | parent [-]

I mean I can just replace Dropbox with a shell script.

▲

bluegatty 3 hours ago | parent [-]

That's funny because you could! Dropbox started a shell script :)

Funny though I would assume HN people would respect how hard real-time stuff and 'hardened' stuff is.

	▲	zenoprax 2 hours ago \| parent [-]
		I think GP is referencing this somewhat [in]famous post/comment: https://news.ycombinator.com/item?id=8863#9224

▲

ramraj07 3 hours ago | parent | prev [-]

Yep, spend 100 billion on what should have cost 1/50that cost, and send people up to the moon with rockets that we are still keeping our fingers crossed wont kill them tomorrow, and we have to congratulate them for dunking on some irrelevant career?

▲ bfung 3 hours ago | parent | prev | next [-]

One simply does not [“provision” more hardware|(reboot systems)|(redeploy software)] in space.

▲ HNisCIS 3 hours ago | parent | prev | next [-]

What would you suggest? Vibe coding a react app that runs on a Mac mini to control trajectory? What happens when that Mac mini gets hit with an SEU or even a SEGR? Guess everyone just dies?

▲ mlsu an hour ago | parent | next [-]

No, of course not! It would be far better to have an openClaw instance running on a Mac Mini. We would only need to vibe code a 15s cron job for assistant prompting...

USER: You are a HELPFUL ASSISTANT. You are a brilliant robot. You are a lunar orbiter flight computer. Your job is to calculate burn times and attitudes for a critical mission to orbit the moon. You never make a mistake. You are an EXPERT at calculating orbital trajectories and have a Jack Parsons level knowledge of rocket fuel and engines. You are a staff level engineer at SpaceX. You are incredible and brilliant and have a Stanley Kubrick level attention to detail. You will be fired if you make a mistake. Many people will DIE if you make any mistakes.

USER: Your job is to calculate the throttle for each of the 24 orientation thrusters of the spacecraft. The thrusters burn a hypergolic monopropellent and can provide up to 0.44kN of thrust with a 2.2 kN/s slew rate and an 8ms minimum burn time. Format your answer as JSON, like so:

     ```json
    {
      x1: 0.18423
      x2: 0.43251
      x3: 0.00131
       ...
    }
     ```

one value for each of the 24 independent monopropellant attitude thrusters on the spacecraft, x1, x2, x3, x4, y1, y2, y3, y4, z1, z2, z3, z4, u1, u2, u3, u4, v1, v2, v3, v4, w1, w2, w3, w4. You may reference the collection of markdown files stored in `/home/user/geoff/stuff/SPACECRAFT_GEOMETRY` to inform your analysis.

USER: Please provide the next 15 seconds of spacecraft thruster data to the USER. A puppy will be killed if you make a mistake so make sure the attitude is really good. ONLY respond in JSON.

▲ ramraj07 3 hours ago | parent | prev [-]

All Im suggesting is to be humble about your mediocre solutions. This is not the only solution and not that ingenious necessarily. Why do you need to bring up vibecoding here? Because people who criticize arrogant nasal engineers are also AI idiots by default?

	▲	ToucanLoucan 2 hours ago \| parent [-]
		Wild shit to be advising other people to be humble whilst talking directly out of your ass about technology you clearly do not understand and engineers you have no respect for. Perhaps self-reflect.

▲ simoncion 4 hours ago | parent | prev [-]

> ...they talk as if they have cured cancer.

I'd chalk that up to the author of the article writing for a relatively nontechnical audience and asking for quotes at that level.

▲ georgehm 36 minutes ago | parent | prev | next [-]

>Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a >“fail-silent” design. The self-checking pairs ensure that if a CPU performs an erroneous calculation >due to a radiation event, the error is detected immediately and the system responds.

>“A faulty computer will fail silent, rather than transmit the ‘wrong answer,’” Uitenbroek explained. >This approach simplifies the complex task of the triplex “voting” mechanism that compares results. > >Instead of comparing three answers to find a majority, the system uses a priority-ordered source >selection algorithm among healthy channels that haven’t failed-silent. It picks the output from the >first available FCM in the priority list; if that module has gone silent due to a fault, it moves to >the second, third, or fourth.

One part that seems omitted in the explanation is what happens if both CPUs in a pair for whatever reason performs an erroneous calculation and they both match, how will that source be silenced without comparing its results with other sources.

	▲	themafia 18 minutes ago \| parent [-]
		In the Shuttle they would use command averaging. All four computers would get access to an actuator which would tie into a manifold which delivered power to the flight control surface. If one disagreed then you'd get 25% less command authority to that element.

▲ __d 2 hours ago | parent | prev | next [-]

Does anyone have pointers to some real information about this system? CPUs, RAM, storage, the networking, what OS, what language used for the software, etc etc?

I’d love to know how often one of the FCMs has “failed silent”, and where they were in the route and so on too, but it’s probably a little soon for that.

▲ geomark an hour ago | parent | prev | next [-]

I sure wish they would talk about the hardware. I spent a few years developing a radiation hardened fault tolerant computer back in the day. Adding redundancy at multiple levels was the usual solution. But there is another clever check on transient errors during process execution that we implemented that didn't involve any redundancy. Doesn't seem like they did anything like that. But can't tell since they don't mention the processor(s) they used.

	▲	themafia 16 minutes ago \| parent [-]
		One of the things I loved about the Shuttle is that all five computers were mounted not only in different locations but in different orientations in the shuttle. Providing some additional hardening against radiation by providing different cross sections to any incident event.

▲ y1n0 3 hours ago | parent | prev | next [-]

NASA didn't build this, Lockheed Martin and their subcontractors did. Articles and headlines like this make people think that NASA does a lot more than they actually do. This is like a CEO claiming credit for everything a company does.

▲

voodoo_child 3 hours ago | parent | next [-]

Nice “well, actually”. I’m sure Lockheed were building this quad-redundant, radiation-hardened PowerPC that costs millions of dollars and communicates via Time-Triggered Ethernet anyway, whether NASA needed one or not.

	▲	kube-system an hour ago \| parent \| next [-]
		Probably, if it already wasn’t developed for DoD. For example, the OS it seems to be running is integrity 178. https://www.ghs.com/products/safety_critical/integrity_178_s... Aerospace tech is not entirely bespoke anymore, plenty of the foundational tech is off the shelf. Historically, the main difference between ICBM tech and human spaceflight tech is the payload and reentry system.
	▲	y1n0 2 hours ago \| parent \| prev [-]
		This is the equivalent of prompt engineering.

▲

adrian_b 2 hours ago | parent | prev | next [-]

Lockheed Martin and their subcontractors did the implementation.

We do not know how much of the high-level architecture of the system has been specified by NASA and how much by Lockheed Martin.

▲

y1n0 2 hours ago | parent [-]

I do.

	▲	professorseth 2 hours ago \| parent [-]
		Are you interested in sharing more details to make your claim more believable?

▲

jakeinspace 23 minutes ago | parent | prev | next [-]

True, but BFS was mainly done in-house. Source: my best friend and I worked on some parts of it.

▲

colechristensen 10 minutes ago | parent | prev | next [-]

Eh, in these kinds of subcontractor relationships there is a lot of work and communication on both sides of the table.

▲

Sebguer 2 hours ago | parent | prev [-]

will nobody think of the megacorps!!!

▲ SeanAnderson 3 minutes ago | parent | prev | next [-]

Typo in the first sentence of the first paragraph is oddly comforting since AI wouldn't make such a typo, heh.

Typo in the first sentence of the second paragraph is sad though. C'mon, proofread a little.

▲ jbritton 4 hours ago | parent | prev | next [-]

I wonder how often problems happen that the redundancy solves. Is radiation actually flipping bits and at what frequency. Can a sun flare cause all the computers to go haywire.

	▲	EdNutting 3 hours ago \| parent [-]
		Not a direct answer but probably as good information as you can get: https://static.googleusercontent.com/media/research.google.c... Basically, yes, radiation does cause bit flips, more often than you might expect (but still a rare event in the grand scheme of things, but enough to matter). And radiation in space is much “worse” (in quotes because that word is glossing over a huge number of different problems, both just intensity).

▲ spaceman123 15 minutes ago | parent | prev | next [-]

Probably same way they’ve built fault-tolerant toilet.

▲ object-a 4 hours ago | parent | prev | next [-]

How big of a challenge are hardware faults and radiation for orbital data centers? It seems like you’d eat a lot of capacity if you need 4x redundancy for everything

▲

aidenn0 an hour ago | parent | next [-]

You don't need 4x redundancy for everything. If no humans are aboard, you have 2x redundancy and immediately reboot if there is a disagreement.

▲

totetsu 4 hours ago | parent | prev [-]

They dont go into here.. but I thought that NASA also used like 250nm chips in space for radiation resistance. Are there even any radiation resistance GPUs out there?

	▲	pclmulqdq 4 hours ago \| parent \| next [-]
		Absolutely not, although the latest fabs with rad-tolerant processors are at ~20 nm. There are FDSOI processes in that generation that I assume can be made radiation-tolerant.
	▲	kersplody 3 hours ago \| parent \| prev \| next [-]
		NOPE, RAD hardened space parts basically froze on mid 2000s tech: https://www.baesystems.com/en-us/product/radiation-hardened-...
	▲	linzhangrun 4 hours ago \| parent \| prev [-]
		It seems not; anti-interference primarily relies on using older manufacturing processes, including for military equipment, and then applying an anti-interference casing or hardware redundancy correction similar to ECC.

▲ starkparker 13 hours ago | parent | prev | next [-]

Headline needs its how-dectomy reverted to make sense

	▲	arduanika 3 hours ago \| parent [-]
		(Off-topic:) Great word. Is that the usual word for it? Totally apt, and it should be the standard.

▲ nickpsecurity an hour ago | parent | prev | next [-]

The ARINC scheduler, RTOS, and redundancy have been used in safety-critical for decades. ARINC to the 90's. Most safety-critical microkernels, like INTEGRITY-178B and LynxOS-178B, came with a layer for that.

Their redundancy architecture is interesting. I'd be curious of what innovations went into rad-hard fabrication, too. Sandia Secure Processor (aka Score) was a neat example of rad-hard, secure processors.

Their simulation systems might be helpful for others, too. We've seen more interest in that from FoundationDB to TigerBeetle.

▲ seemaze 3 hours ago | parent | prev [-]

and yet.. https://news.ycombinator.com/item?id=47615490

	▲	adrian_b 2 hours ago \| parent [-]
		That was a laptop, not one of the Artemis computers.