Remix.run Logo
Animats a day ago

ECC should have become standard around the time memories passed 1GB.

It's seriously annoying that ECC memory is hard to get and expensive, but memory with useless LEDs attached is cheap.

loeg a day ago | parent | next [-]

It's not even ECC price/availability that bothers me so much, it's that getting CPUs and motherboards that support ECC is non-trivial outside of the server space. The whole consumer class ecosystem is kind of shitty. At least AMD allows consumer class CPUs to kinda sorta use ECC, unlike Intel's approach where only the prosumer/workstation stuff gets ECC.

rpcope1 a day ago | parent [-]

I've been honestly amazed people actually buy stuff that's not "workstation" gear given IME how much more reliably and consistently it works, but I guess even a generation or two used can be expensive.

throwaway85825 a day ago | parent | next [-]

Very few applications scale with cores. For the vast majority of people single core performance is all they care about, it's also cheaper. They don't need or want workstation gear.

rpcope1 7 hours ago | parent | next [-]

I have come to doubt that single core or CPU performance in general, other than maybe specialty applications like CAD and some games, is all that noticeable for most computer users in the last decade. I can take relatively pedestrian users like my parents or my wife and put them in front of a decade old high end Haswell system or a brand new mega-$$$ threadripper/epyc and for almost all intents and purposes they don't notice a different. What they do notice is when things die. I'm sure consumer hardware might be OK for 2-3 years (maybe), but like for my parents, they're happier to keep using the same computer, and honestly the same Dell Precision system I gave them almost 10 years ago works great today, and I have a suspicion that the hardware, outside of maybe the SSD finally wearing out, will probably work right a decade from now too.

rafaelmn 15 hours ago | parent | prev [-]

> Very few applications scale with cores

You mean like compilers and test suites ? Very few professional workloads don't parallelize well these days.

VorpalWay 13 hours ago | parent | next [-]

Compilers and test suits do scale (at least for C/C++ and Rust, which is what I work with). But I think the parent comment referred to consumer applications: games, word processing, light browsing, ...

(Though games these days scale better than they used to, but only up to a to a point.)

I find that most tools I write for my own use can be made to scale with cores, or run so fast that the overhead of starting threads is longer than the program runtime. But I write that in Rust which makes parallelism easy. If I wrote that code in C++ I would probably not bother with trying to parallelize.

rafaelmn 10 hours ago | parent [-]

But those tools aren't really compute bound anyway - you're not buying a workstation to do them, you're getting a consumer laptop or a tablet.

loeg 8 hours ago | parent [-]

And that consumer device should have ECC! That's the whole discussion here.

zadikian 7 hours ago | parent [-]

It's confusing because a few comments up is "for the vast majority of people single core performance is all they care about, it's also cheaper" which is unrelated to ECC.

loeg 6 hours ago | parent [-]

I think it's coherent -- it's an argument for why most people don't want to buy Workstation class products just to get ECC. (Prices scale with core count. Not linearly, but still.)

loeg 8 hours ago | parent | prev | next [-]

Test suites often don't scale, actually. Unit tests usually run single-threaded by default, and also relatively often have side effects on the system that mean they're unsafe to run in parallel. (Sure, sure, you could definitely argue the latter thing is a skill issue.)

zadikian 7 hours ago | parent | prev | next [-]

In theory, do you need a single machine for any of that, or would it be cheaper to use a low-availability cloud cluster? Tests are totally independent, and builds probably parallel enough.

throwaway85825 8 hours ago | parent | prev [-]

Only a small percentage of computer users are programmers.

zadikian 18 hours ago | parent | prev | next [-]

There were several years where used cheese grater Mac Pros could be bought and upgraded for very cheap, and were still not too outdated. I only replaced my MacPro4,1 when the M1 mini came out, mainly cause of wattage.

loeg a day ago | parent | prev | next [-]

I've had zero issues with AMD's consumer tier of non-WX Threadripper and Ryzen models, FWIW.

thousand_nights a day ago | parent | prev | next [-]

overblown? billions of users use consumer tier hardware just fine. i have servers at home with years of uptime without any ECC memory

conception 20 hours ago | parent [-]

But how much bit rot? You’ll never know.

Maxion 16 hours ago | parent | next [-]

If I don't know about it, then how does it affect me / why should I care? My home server does what it is supposed to do and has done so for a decade. If bit rot /bit flips in memory does not affect my day-to-day life I much prefer cheaper hardware.

I do hope the nuclear powerplant next door uses more fault tolerant hardware, though.

loeg 8 hours ago | parent [-]

Eventually you might notice the pictures or other documents you were saving on your home server have artifacts, or no longer open. This is undesireable for most people using computer storage.

> I much prefer cheaper hardware.

The cost savings are modest; order of magnitude 12% for the DIMMs, and less elsewhere. Computers are already extremely cheap commodities.

zadikian 7 hours ago | parent [-]

12% for the DIMMs only, but with Intel you need Xeon and its accompanying motherboard for it. Someone said AMD "kinda" lets you do ECC on consumer hardware, not sure what the caveats are besides just being unbuffered.

Assuming that's more due to intentional market segmentation than actual cost, yeah I would pay 12% more for ECC. But I'm with the other guy on not valuing it a ton. I have backups which are needed regardless of bitrot, and even if those don't help, losing a photo isn't a huge deal for me.

loeg 6 hours ago | parent [-]

> Someone said AMD "kinda" lets you do ECC on consumer hardware, not sure what the caveats are besides just being unbuffered.

That was me. It isn't "officially" supported by AMD, but it should work. You can enable EDAC monitoring in Linux and observe detected correction events happening.

> Assuming that's more due to intentional market segmentation than actual cost

That's the argument, yeah.

zadikian 7 hours ago | parent | prev [-]

I'm more concerned how the Mac filesystems don't have payload checksums.

deepsun 17 hours ago | parent | prev [-]

I hate my workstation desktop I assembled 15 years. It just doesn't break! I have no excuses to buy a new one (except for video card).

justin66 18 hours ago | parent | prev | next [-]

> ECC should have become standard around the time memories passed 1GB.

Ironically, that's around the time Intel started making it difficult to get ECC on desktop machines using their CPUs. The Pentium 3 and 440BX chipset, maxing out at 1GB, were probably the last combo where it pretty commonly worked with a normal desktop board and normal desktop processor.

WatchDog a day ago | parent | prev | next [-]

All DDR5 ram has some amount of error correction built in, because DDR5 is much more prone to bit flipping, it requires it.

I'm not really sure if this makes it overall more or less reliable than DDR2/3/4 without ECC though.

jml7c5 12 hours ago | parent | next [-]

As I understand it, DDR5's on-die ECC is mostly a cost-saving measure. Rather than fab perfect DRAM that never flips a bit in normal operation (expensive, lower yield), you can fab imperfect DRAM that is expected to sometimes flip, but then use internal ECC to silently correct it. The end result to the user is theoretically the same.

Because you can't track on-die ECC errors, you have no way of knowing how "faulty" a particular DRAM chip is. And if there's an uncorrected error, you can't detect it.

jcalvinowens 6 hours ago | parent | prev | next [-]

DDR5 on-die ECC detects and corrects one-bit errors. It cannot detect two-bit errors, so it will miscorrect some of them into three-bit errors. However, the on-die error correction scheme is specifically specially designed such that the resulting three-bit errors are mathematically guaranteed to be detected as uncorrectable two-bit errors by a standard full system-level ECC running on top of the on-die ECC.

matheusmoreira 2 hours ago | parent | prev | next [-]

ECC also reports error recovery statistics to the operating system. Lets you know if any unrecoverable errors happened. Lets you calculate the error rate which means you can try to predict when your memory modules are going bad.

I think this sort of reporting is a pretty basic feature that should come standard on all hardware. No idea why it's an "enterprise" feature. This market segmentation is extremely annoying and shouldn't exist.

himata4113 21 hours ago | parent | prev [-]

that doesn't help when the bit is lost between the cpu and the memory unfortunately, it only really helps passing poor quality dram as it gets corrected for single bit flips, not that reliable either it's a yield / density enabler rather than a system reliability thing.

it's "ECC" but not the ecc you want, marketing garbage.

tombert 20 hours ago | parent | prev | next [-]

I am not sure I've ever seen a laptop that has ECC memory. I'm sure they exist but I don't think I've seen it.

I would definitely like to have a laptop with ECC, because obviously I don't want things to crash and I don't want corrupted data or anything like that, but I don't really use desktop computers anymore.

bpye 17 hours ago | parent [-]

There are 16" laptops with ECC, you can get a ThinkPad P16 with it for example. I've yet to find any 14" devices with ECC though.

tombert 17 hours ago | parent [-]

Interesting, I actually have a thinkpad p16s, surprised I didn’t notice ECC availability.

oybng a day ago | parent | prev | next [-]

For the unaware, Intel is to blame for this

johanyc 16 hours ago | parent [-]

Can you explain

samus 14 hours ago | parent [-]

It makes economic sense to keep selling non-ECC hardware to maintain market segmentation.

aforwardslash a day ago | parent | prev | next [-]

ECC are traditionally slower, quite more complex, and they dont completely eliminate the problem (most memories correct 1 bit per word and detect 2 bits per word). They make sense when environmental factors such as flaky power, temperature or RF interference can be easily discarded - such as a server room. But yeah, I agree with you, as ECC solves like 99% of the cases.

indolering a day ago | parent | next [-]

Being able to detect these issues is just as important as preventing them.

aforwardslash a day ago | parent [-]

Thing is, every reported bug can be a bit flip. You can actually in some cases have successful execution, but bitflips in the instrumentation reporting errors that dont exist.

russdill 14 hours ago | parent | prev | next [-]

The amount of overhead a few bits of ECC has is basically a rounding error, and even then, the only time the hardware is really doing extra work is when bit errors occur and correction has to happen.

The main overhead is simply the extra RAM required to store the extra bits of ECC.

jeffbee a day ago | parent | prev [-]

ECC are "slower" because they are bought by smart people who expect their memory to load the stored value, rather than children who demand racing stripes on the DIMMs.

matja 14 hours ago | parent | next [-]

The actual RAM chips on a ECC DIMM are exactly the same as a non-ECC DIMM, there's just an extra 1/2/4 chips to extend to 72 bit words.

The main reason ECC RAM is slower is because it's not (by default) overclocked to the point of stability - the JEDEC standard speeds are used.

The other much smaller factors are:

* The tREFi parameter (refresh interval) is usually double the frequency on ECC RAM, so that it handles high-temperature operation. * Register chip buffers the command/address/control/clock signals, adding a clock of latency the every command (<1ns, much smaller than the typical memory latency you'd measure from the memory controller) * ECC calculation (AMD states 2 UMC cycles, <1ns).

Dylan16807 16 hours ago | parent | prev | next [-]

ECC keeps your bits safe from random flips to a ridiculously large factor. You can run the memory at high consumer speeds, giving up some of that safety margin, while still being more reliable than everything else in your computer.

And there's non-random bit errors that can hit you at any speed, so it's not like going slow guarantees safety.

undersuit 21 hours ago | parent | prev | next [-]

ECC is actually slower. The hardware to compute every transaction is correct does add a slight delay, but nothing compared to the delay of working on corrupted data.

throwaway85825 a day ago | parent | prev [-]

There's just no demand for high speed ECC aside from a few people making their own dimms.

hedora a day ago | parent | prev | next [-]

ECC is standard at this point (current RAM flips so many bits it's basically mandatory). Also, most CPUs have "machine checks" that are supposed to detect incorrect computations + alert the OS.

However, there are still gaps. For one thing, the OS has to be configured to listen for + act on machine check exceptions.

On the hardware level, there's an optional spec to checksum the link between the CPU and the memory. Since it's optional, many consumer machines do not implement it, so then they flip bits not in RAM, but on the lines between the RAM and the CPU.

It's frustrating that they didn't mandate error detection / correction there, but I guess the industry runs on price discrimination, so most people can't have nice things.

ece 9 hours ago | parent | prev [-]

Looking back, I actually think the older the RAM the more likely you're able to notice bit-flips and they harm your workflow. EDO RAM was the worst in my experience (my first computer), SDRAM was a bit better, and random bit-flips atleast under load got very rare after DDR2. I think Google even had a paper comparing DDR1 vs DDR2 (link: https://static.googleusercontent.com/media/research.google.c...).

That said, memory DIMM capacity increases with even a small chance of bit-flips means lots of people will still be affected.