▲ | consp 5 days ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Isn't it mostly an ease of mind thing? I've never seen a ECC error on my home server which has plenty of memory in use and runs longer than my desktop. Maybe it's more common with higher clocked, near the limit, desktop PC's. Also: DDR5 has some false ecc marketing due to the memory standard having an error correction scheme build in. Don't fall for it. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | adrian_b 4 days ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Whether you will see ECC errors depends a lot on how much memory you have and how old it is. A computer with 64 GB of memory is 4 times more likely to encounter memory errors than one with 16 GB of memory. When DIMMs are new, at the usual amounts of memory for desktops, you will see at most a few errors per year, sometimes only an error after a few years. With old DIMMs, some of them will start to have frequent errors (such modules presumably had a borderline bad fabrication quality and now have become worn out, e.g. due to increased leakage leading to storing a lower amount of charge on the memory cell capacitors). For such bad DIMMs, the frequency of errors will increase, and it may become of several errors per day, or even per hour. For me, a very important advantage of ECC has been the ability to detect such bad memory modules (in computers that have been used for 5 years or more) and replace them before corrupting any precious data. I also had a case with a HP laptop with ECC, where memory errors had become frequent after being stored for a long time (more than a year) in a rather humid place, which might have caused some oxidation of the SODIMM socket contacts, because removing the SODIMMs, scrubbing the sockets and reinserting the SODIMMs made disappear the errors. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | c0l0 5 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I see a particular ECC error at least weekly on my home desktop system, because one of my DIMMs doesn't like the (out of spec) clock rate that I make it operate at. Looks like this:
(this is `sudo ras-mc-ctl --errors` output)It's always the same address, and always a Corrected Error (obviously, otherwise my kernel would panic). However, operating my system's memory at this clock and latency boosts x265 encoding performance (just one of the benchmarks I picked when trying to figure out how to handle this particular tradeoff) by about 12%. That is an improvement I am willing to stomach the extra risk of effectively overclocking the memory module beyond its comformt zone for, given that I can fully mitigate it by virtue of properly working ECC. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | wpm 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I had a somewhat dodgy stick of used RAM (DDR4 UDIMM) in a Supermicro X11 board. This board is running my NAS, all ZFS, so RAM corruption can equal data corruption. The OS alerted me to recoverable errors on DIMM B2. Swapped it and another DIMM, rebooted, saw DIMM error on slot B1. Swapped it for a spare stick. No more errors. This was running at like, 1866 or something. It's a pretty barebones 8th gen i3 with a beefier chipset, but ECC still came in clutch. I won't buy hardware for server purposes without it. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | immibis 5 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I saw a corrected memory error logged every few hours when my current machine was new. It seems to have gone away now, so either some burn-in effect, or ECC accidentally got switched off and all my data is now corrupted. Threadripper 7000 series, 4x64GB DDR5. Edit: it's probably because I switched it to "energy efficiency mode" instead of "performance mode" because it would occasionally lock up in performance mode. Presumably with the same root cause. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | Jach 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I have a slightly older system with 128 GB of UDIMM DDR4 over four sticks. Ran just fine for quite a while but then I started having mysterious system freezes. Later discovered I had somehow disabled ECC error reporting in my system log on linux... once that was turned back on, oh, I see notices of recoverable errors. I finally found a repeatable way to trigger a freeze with a memory stress testing tool and that was from an unrecoverable error. I couldn't narrow the problem down to a single stick or RAM channel, it seemed to only happen if all 4 slots were occupied, but I eventually figured out that if I just lowered the RAM speed from standard 3200 MHz to the next officially supported (by the sticks) step of 2933 MHz, everything was fine again and no more ECC errors, recoverable or not. Been running like that since. Last winter I was helping someone put together a new gaming machine... it was so frustrating running into the fake ecc marketing for DDR5 that you mention. The motherboard situation for whether they support it or not, or whether a bios update added support or then removed it or added it back or not, was also really sad. And even worse IMO is that you can't actually max out 4 slots on the top tier mobos unless you're willing to accept a huge drop in RAM speed. Leads to ugly 48 GB sized sticks and limiting to two of them... In the end we didn't go with ECC for that someone, but I was pretty disappointed about it. I'm hoping the next gen will be better, for my own setup running ZFS and such I'm not going to give up ECC. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | hedora 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
You have to go pretty far down the rabbit hole to make sure you’ve actually got ECC with [LP]DDR5 Some vendors use hamming codes with “holes” in them, and you need the CPU to also run ECC (or at least error detection) between ram and the cache hierarchy. Those things are optional in the spec, because we can’t have nice things. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | BikiniPrince 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I pick up old serves for my garage system. With edac it is a dream to isolate the fault and be instantly aware. It also lets you determine the severity of the issue. Dimms can run for years with just the one error or overnight explode into streams of corrections. I keep spares so it’s fairly easy to isolate any faults. It’s just how do you want to spend your time? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | Scramblejams 4 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I run a handful of servers and I have a couple that pop ECC errors every year or three, so YMMV. |