Remix.run Logo
lugao 3 hours ago

Only people who never interacted with data center reliability think it's doable to maintain servers with no human intervention.

keepamovin 2 hours ago | parent | next [-]

Whoa there, space-faring sysadmin. You really want that off-world contract tho?

lugao 2 hours ago | parent [-]

Haha, hard pass on the job. I prefer my oxygen at 1 atm.

I'm not a data center technician myself, but I have deep respect for those folks and the complexity they manage. It's quite surprising the market still buys Musk's claims day after day.

andrewinardeer 28 minutes ago | parent | prev | next [-]

This guy invented reusable rockets that land themselves. I'm sure xAI is not just one guy. Plenty of talented people work there.

jmyeet 2 hours ago | parent | prev | next [-]

There are a class of people who may seem smart until they start talking about a subject you know about. Hank Green is a great example of this.

For many on HN, Elon buying Twitter was a wake up call because he suddenly started talking about software and servers and data centers and reliability and a ton of people with experience with those things were like "oh... this guy's an idiot".

Data centers in space are exactly like this. Your comment (correctly) alludes to this.

Companies like Google, Meta, Amazon and Microsoft all have so many servers that parts are failing constantly. They fail so often on large scales that it's expected things like a hard drive will fail while a single job might be running.

So all of these companies build systems to detect failures, disable running on that node until it's fixed, alerting someone to what the problem is and then bringing the node back online once the problem it's addressed. Everything will fail. Hard drives, RAM, CPUs, GPUs, SSDs, power supplies, fans, NICs, cables, etc.

So all data centers will have a number of technicians who are constantly fixing problems. IIRC Google's ratio tended to be about 10,000 servers per technician. Good technicians could handle higher ratios. When a node goes offline it's not clear why. Techs would take known good parts and basically replacce all of them and then figure out what the problem is later, dispose of any bad parts and put tested good parts into the pool of known good parts for a later incident.

Data centers in space lose all of this ability. So if you have a large number of orbital servers, they're going to be failing constantly with no ability to fix them. You can really only deorbit them and replace them and that gets real expensive.

Electronics and chips on satellites also aren't consumer grade. They're not even enterprise grade. They're orders of magnitude more reliable than that because they have to deal with error correction terrestial components don't due to cosmic rays and the solar wind. That's why they're a fraction of the power of something you can buy from Amazon but they cost 1000x as much. Because they need to last years and not fail, something no home computer or data center server has to deal with.

Put it this way, a hardened satellite or probe CPU is like paying $1 million for a Raspberry Pi.

And anybody who has dealt with data centers knows this.

fblp 2 hours ago | parent | next [-]

Great comment on hardware and maintenance costs, and in comparison Elon wrote "My estimate is that within 2 to 3 years, the lowest cost way to generate AI compute will be in space." It's a pity this reads like the entire acquisition of xAi is based on "Elon's napkin math" (maybe he checked it with Grok)

breakyerself an hour ago | parent | next [-]

He's bailing out one of his failing ventures with one of his so far successful ones. The BS napkin math isn't the reason he's doing it. It's the excuse for doing it.

titzer an hour ago | parent | prev [-]

Can you provide a link for that quote, because that quote is absolute stupidity.

spenczar5 an hour ago | parent [-]

It's in the article that you're commenting on, https://www.spacex.com/updates#xai-joins-spacex.

rkagerer an hour ago | parent | prev | next [-]

Thanks for putting words to that; the paragraph which most stuck out to me as outlandish is (emphasis mine):

    The basic math is that launching a million tons per year of satellites generating 100 kW of compute power per ton would add 100 gigawatts of AI compute capacity annually, *with no ongoing operational or maintenance needs*.
I'm deeply disillusioned to arrive at this conclusion but the Occam's Razor in me feels this whole acquisition is more likely a play to increase the perceptual value of SpaceX before a planned IPO.
everfrustrated 2 hours ago | parent | prev [-]

Might be why he's also investing in building their own fabs - if he can keep the silicon costs low then that flips a lot of the math here.

elihu 3 hours ago | parent | prev | next [-]

Do they need to be maintained? If one compute node breaks, you just turn it off and don't worry about it. You just assume you'll have some amount of unrecoverable errors and build that into the cost/benefit analysis. As long as failures are in line with projections, it's baked in as a cost of doing business.

The idea itself may be sound, though that's unrelated to the question of whether Elon Musk can be relied on to be honest with investors about what their real failure projections and cost estimates are and whether it actually makes financial sense to do this now or in the near future.

lugao 3 hours ago | parent [-]

AI clusters are heavily interconnected, the blast radius for single component failure is much larger than running single nodes -- you would fragment it beyond recovery to be able to use it meaningfully.

I can't get in detail about real numbers but it's not doable with current hardware by a large margin.

angled 3 hours ago | parent | prev [-]

But … but what if we had solar-powered AI SREs to fix the solar-powered AI satellites… /in space/?

lugao 3 hours ago | parent [-]

Maintaining modern accelerators requires frequent hands-on intervention -- replacing hardware, reseating chips, and checking cable integrity.

Because these platforms are experimental and rapidly evolving, they aren't 'space-ready.' Space-grade hardware must be 'rad-hardened' and proven over years of testing.

By the time an accelerator is reliable enough for orbit, it’s several generations obsolete, making it nearly impossible to compete or turn a profit against ground-based clusters.

trothamel 2 hours ago | parent [-]

On the other hand, Tesla vehicles have similar hardware built into them, and don't require such hands-on intervention. (And that's the hardware that will be going up.)

lugao 2 hours ago | parent | next [-]

Car-grade inference hardware is fundamentally different from data center-grade inference hardware, let alone the specialized, interconnected hardware used for training (like NVLink or complex optical fabrics). These are different beasts in terms of power density, thermal stress, and signaling sensitivity.

Beyond that, we don't actually know the failure rate of the Tesla fleet. I’ve never had a personal computer fail from use in my life, but that’s just anecdotal and holds no weight against the law of large numbers. When you operate at the scale of a massive cluster, "one-in-a-million" failures become a daily statistical certainty.

Claiming that because you don't personally see cars failing on the side of the road means they require zero intervention actually proves my original point: people who haven't managed data center reliability underestimate the sheer volume of "rare" failures that occur at scale.

trothamel 2 hours ago | parent [-]

https://x.com/elonmusk/status/2017792776415682639

For what it's worth, this project plans to use Tesla AI5/AI6 hardware for the first launches.

jonah 2 hours ago | parent | prev [-]

Not only the sibling comments points, but cars aren't exposed to the radiation of space...