Just don't use genetically identical hardware:

https://news.ycombinator.com/item?id=32031639

https://news.ycombinator.com/item?id=32032235

Edit: wow, I can't believe we hadn't put https://news.ycombinator.com/item?id=32031243 in https://news.ycombinator.com/highlights. Fixed now.

▲

hinkley 5 days ago | parent | next [-]

I’ve seen this up close twice and I’m surprised it’s only twice. Between March and September one year, 6 people on one team had to get new hard drives in their thinkpads and rebuild their systems. All from the same PO but doled out over the course of a project rampup. That was the first project where the onboarding docs were really really good, since we got a lot of practice in a short period of time.

Long before that, the first raid array anyone set up for my (teams’) usage, arrived from Sun with 2 dead drives out of 10. They RMA’d us 2 more drives and one of those was also DOA. That was a couple years after Sun stopped burning in hardware for cost savings, which maybe wasn’t that much of a savings all things considered.

▲

gogusrl 5 days ago | parent | prev | next [-]

I got burnt by this bug on freakin' Christmas Eve 2020 ( https://forum.hddguru.com/viewtopic.php?f=10&t=40766 ). There was some data loss and a lot of lessons learned.

▲

praccu 5 days ago | parent | prev | next [-]

Many years ago (13?), I was around when Amazon moved SABLE from RAM to SSDs. A whole rack came from a single batch, and something like 128 disks went out at once.

I was an intern but everyone seemed very stressed.

▲

airstrike 5 days ago | parent | prev | next [-]

I love that "Ask HN: What'd you do while HN was down?" was a thing

	▲	Cthulhu_ 5 days ago \| parent [-]
		My plan B was going to the Stack Exchange homepage for some interesting threads but it got repetitive.

▲

Cthulhu_ 5 days ago | parent | prev [-]

Man I hit something like that once, a SSD had a firmware bug where it would stop working at an exact number of hours.