Remix.run Logo
ambrosio 3 hours ago

Memory and disk corruption definitely were a problem in the early days. See https://news.ycombinator.com/item?id=14206811 for example. I also recall an anecdote about how the search index basically became unbuildable beyond a certain size due to the probability of corruption, which was what inspired RecordIO. I think ECC RAM and transport checksums largely fixed those problems.

It's pretty challenging for software to defend against SEUs corrupting memory, especially when retrofitting an existing design like Linux. While operating Forge, we saw plenty of machines miscompute stuff, and we definitely worried about garbage getting into our caches. But my recollection is that the main cause was individual bad CPUs. We would reuse files in tmpfs for days without reverifying their checksums, and while we considered adding a scrubber, we never saw evidence that it would have caught much.

Maybe the CPU failures were actually due to radiation damage, but as they tended to be fairly sticky, my guess is something more like electromigration.