Remix.run Logo
dijit 5 days ago

> only a minority of them enables ECC support in their firmware, so always check for that before buying!

This is the annoying part.

That AMD permits ECC is a truly fantastic situation, but if it's supported by the motherboard is often unlikely and worse: it's not advertised even when it's available.

I have an ASUS PRIME TRX40 PRO and the tech specs say that it can run ECC and non-ECC but not if ECC will be available to the operating system, merely that the DIMMS will work.

It's much more hit and miss in reality than it should be, though this motherboard was a pricey one: one can't use price as a proxy for features.

sunshowers 4 days ago | parent | next [-]

If you're on Linux, dmesg containing

  EDAC MC0: Giving out device to module amd64_edac
is a pretty reliable indication that ECC is working.

See my blog post about it (it was top of HN): https://sunshowers.io/posts/am5-ryzen-7000-ecc-ram/

oneshtein 4 days ago | parent [-]

My `dmesg` tells:

    EDAC MC0: Giving out device to module igen6_edac controller Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
    EDAC MC1: Giving out device to module igen6_edac controller Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
but `dmidecode --type 16` says:

    Error Correction Type: None
    Error Information Handle: Not Provided
c0l0 4 days ago | parent | next [-]

Are you sure you have ECC-capable DIMM installed?

What does

    find /sys/devices/system/edac/mc/mc0/csrow* -maxdepth 1 -type f -exec grep --color . {} +
report?
oneshtein 2 days ago | parent [-]

AFAIK, I have 2x DDR5 non-ECC memory (`dmidecode --type 17` says Samsung M425R1GB4BB0-CQKOL). Your command tells about SECDEC (single bit error correction, double bit error detection).

    /sys/devices/system/edac/mc/mc0/csrow0/ce_count:0
    /sys/devices/system/edac/mc/mc0/csrow0/ch0_dimm_label:MC#0_Chan#0_DIMM#0
    /sys/devices/system/edac/mc/mc0/csrow0/size_mb:8192
    /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
    /sys/devices/system/edac/mc/mc0/csrow0/ue_count:0
    /sys/devices/system/edac/mc/mc0/csrow0/mem_type:Unbuffered-DDR3
    /sys/devices/system/edac/mc/mc0/csrow0/edac_mode:SECDED
    /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
    /sys/devices/system/edac/mc/mc0/csrow0/ch1_dimm_label:MC#0_Chan#1_DIMM#0
    /sys/devices/system/edac/mc/mc0/csrow0/dev_type:x16
> find /sys/devices/system/edac/mc/mc0/csrow* -maxdepth 1 -type f -exec grep --color . {} +

It looks like DDR5 supports SECDEC by default. :-/

sunshowers 4 days ago | parent | prev [-]

To be honest I don't know how Intel works, my post is limited to AMD.

c0l0 5 days ago | parent | prev | next [-]

Usually, if a vendor's spec sheet for a (SOHO/consumer-grade) motherboard mentions ECC-UDIMM explicitly in its memory compatibility section, and (but this is a more recent development afaict) DOES NOT specify something like "operating in non-ECC mode only" at the same time, then you will have proper ECC (and therefore EDAC and RAS) support in Linux, if the kernel version you have can already deal with ECC on your platform in general.

I would assume your particular motherboard to operate with proper SECDED+-level ECC if you have capable, compatible DIMM, enable ECC mode in the firmware, and boot an OS kernel that can make sense of it all.

adrian_b 4 days ago | parent | prev [-]

This is weird. I have used many ASUS MBs specified as "can run ECC and non-ECC" and this has always meant that there was an ECC enabling option in the BIOS settings, and then if the OS had an appropriate EDAC driver for the installed CPU ECC worked fine.

I am writing this message on such an ASUS MB with a Ryzen CPU and working ECC memory. You must check that you actually have a recent enough OS to know your Threadripper CPU and that you have installed any software package required for this (e.g. on Linux "edac-utils" or something with a similar name).