Remix.run Logo
c0l0 5 days ago

I see a particular ECC error at least weekly on my home desktop system, because one of my DIMMs doesn't like the (out of spec) clock rate that I make it operate at. Looks like this:

    94 2025-08-26 01:49:40 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68aea758, cpuid=0x00a50f00, bank=0x00000012
    95 2025-09-01 09:41:50 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68b80667, cpuid=0x00a50f00, bank=0x00000012
(this is `sudo ras-mc-ctl --errors` output)

It's always the same address, and always a Corrected Error (obviously, otherwise my kernel would panic). However, operating my system's memory at this clock and latency boosts x265 encoding performance (just one of the benchmarks I picked when trying to figure out how to handle this particular tradeoff) by about 12%. That is an improvement I am willing to stomach the extra risk of effectively overclocking the memory module beyond its comformt zone for, given that I can fully mitigate it by virtue of properly working ECC.

Hendrikto 5 days ago | parent | next [-]

Running your RAM so far out of spec that it breaks down regularly, where do you take the confidence that ECC will still work correctly?

Also: Could you not have just bought slightly faste RAM, given the premium for ECC?

c0l0 5 days ago | parent [-]

"Breaks down" is a strong choice of words for a single, corrected bit error. ECC works as designed, and demonstrates that it does by detecting this re-occurring error. I take the confidence mostly from experience ;)

And no, as ECC UDIMM for the speed (3600MHz) I run mine at simply does not exist - it is outside of what JEDEC ratified for the DDR4 spec.

adithyassekhar 5 days ago | parent [-]

JEDEC rated DDR4 at only 2400mhz right? And anything higher is technically over clocking?

dijit 4 days ago | parent | next [-]

JEDEC has a few frequencies they support: https://www.jedec.org/standards-documents/docs/jesd79-4a

DDR4-1600 (PC4-12800)

DDR4-1866 (PC4-14900)

DDR4-2133 (PC4-17000)

DDR4-2400 (PC4-19200)

DDR4-2666 (PC4-21300)

DDR4-2933 (PC4-23466)

DDR4-3200 (PC4-25600) (the highest supported in the DDR4 generation)

What's *NOT* supported are some enthusiast ones that typically require more than 1.2v for example: 3600 MT/s, 4000 MT/s & 4266 MT/s

c0l0 5 days ago | parent | prev [-]

JEDEC specifies rates up to 3200MT/s, what's officially referred to as DDR4-3200 (PC4-25600).

kderbe 4 days ago | parent | prev | next [-]

I would loosen the memory timings a bit and see if that resolves the ECC errors. x265 performance shouldn't fall since it generally benefits more from memory clock rate than latency.

Also, could you share some relevant info about your processor, mainboard, and UEFI? I see many internet commenters question whether their ECC is working (or ask if a particular setup would work), and far fewer that report a successful ECC consumer desktop build. So it would be nice to know some specific product combinations that really work.

c0l0 4 days ago | parent [-]

I've been on AM4 for most of the past decade (and still am, in fact), and the mainboards I've personally had in use with working ECC support were:

  - ASRock B450 Pro4
  - ASRock B550M-ITX/ac
  - ASRock Fatal1ty B450 Gaming-ITX/ac
  - Gigabyte MC12-LE0
There's probably many others with proper ECC support. Vendor spec sheets usually hint at properly working ECC in their firmware if they mention "ECC UDIMM" support specifically.

As for CPUs, that is even easier for AM4: Everything that's not based on a APU core (there are some SKUs marketed without iGPU that just have the iGPU part of the APU disabled, such as the Ryzen 5 5500) cannot support ECC. An exception to that rule are "PRO"-series APUs, such as the Ryzen 5 PRO 5650G et al., which have an iGPU, but also support ECC. Main differences (apart from the integrated graphics) between CPU and APU SKUs is that the latter do not support PCIe 4.0 (APUs are limited to PCIe 3.0), and have a few Watts lower idle power consumption.

When I originally built the desktop PC that I still use (after a number of in-place upgrades, such as swapping out the CPU/GPU combo for an APU), I blogged about it (in German) here: https://johannes.truschnigg.info/blog/2020-03-23#0033-2020-0...

If I were to build an AM5 system today, I would look into mainboards from ASUS for proper ECC support - they seem to have it pretty much universally supported on their gear. (Actual out-of-band ECC with EDAC support on Linux, not the DDR5 "on-DIE" stuff.)

ainiriand 4 days ago | parent | prev [-]

I think you've found a particularly weak memory cell, I would start thinking about replacing that module. The consistent memory_channel=1, csrow=0 pattern confirms it's the same physical location failing predictably.