| ▲ | Agingcoder 8 hours ago | |||||||
We’re in agreement. I think we diverge on ‘making it go away in my book’. When you’re the one having to debug all these bizarre things ( there were real money numbers involved so these things mattered ), over millions of jobs every day , rare events with low probability don’t disappear - they just happen and take time to diagnose and fix. So in my book ecc improves the situation, but I still had to deal with bad dimms, and ecc wasn’t enough. We used not to see these issues because we already had too many software bugs, but as we got increasingly reliable, hardware issues slowly became a problem, just like compiler bugs or other elements of the chain usually considered reliable. I fully agree that there are lots of other cases where this doesn’t matter and ecc is good enough. Thanks for taking the time to reply ! | ||||||||
| ▲ | RealityVoid 7 hours ago | parent [-] | |||||||
Oh, I get this point. If you have a sufficiently large amount of data an you monitor the errors and your software gets better and better even low probability cases will happen and will stand out. But this is sort of the march of nines. My knee jerk reaction to blaming ECC is "naaah". Mostly because it's such a convenient scapegoat. It happens, I'm sure, but it would not be the first explanation I reach for. I once heard someone blame "cosmic rays" on a bug that happened multiple times. You can imagine how irked I was on the dang cosmic rays hitting the same data with such consistency! Anyways, I'm sorry if my tone sounded abrasive, I, too, have appreciated the discussion. | ||||||||
| ||||||||