Remix.run Logo
c0l0 5 days ago

I realize this has not much to do with CPU choice per se, but I'm still gonna leave this recommendation here for people who like to build PCs to get stuff done with :) Since I've been able to afford it and the market has had them available, I've been buying desktop systems with proper ECC support.

I've been chasing flimsy but very annoying stability problems (some, of course, due to overclocking during my younger years, when it still had a tangible payoff) enough times on systems I had built that taking this one BIG potential cause out of the equation is worth the few dozens of extra bucks I have to spend on ECC-capable gear many times over.

Trying to validate an ECC-less platform's stability is surprisingly hard, because memtest and friends just aren't very reliably detecting more subtle problems. PRIME95, y-cruncher and linpack (in increasing order of effectiveness) are better than specialzied memory testing software in my experience, but they are not perfect, either.

Most AMD CPUs (but not their APUs with potent iGPUs - there, you will have to buy the "PRO" variants) these days have full support for ECC UDIMMs. If your mainboard vendor also plays ball - annoyingly, only a minority of them enables ECC support in their firmware, so always check for that before buying! - there's not much that can prevent you from having that stability enhancement and reassuring peace of mind.

Quoth DJB (around the very start of this millenium): https://cr.yp.to/hardware/ecc.html :)

dijit 5 days ago | parent | next [-]

> only a minority of them enables ECC support in their firmware, so always check for that before buying!

This is the annoying part.

That AMD permits ECC is a truly fantastic situation, but if it's supported by the motherboard is often unlikely and worse: it's not advertised even when it's available.

I have an ASUS PRIME TRX40 PRO and the tech specs say that it can run ECC and non-ECC but not if ECC will be available to the operating system, merely that the DIMMS will work.

It's much more hit and miss in reality than it should be, though this motherboard was a pricey one: one can't use price as a proxy for features.

sunshowers 4 days ago | parent | next [-]

If you're on Linux, dmesg containing

  EDAC MC0: Giving out device to module amd64_edac
is a pretty reliable indication that ECC is working.

See my blog post about it (it was top of HN): https://sunshowers.io/posts/am5-ryzen-7000-ecc-ram/

oneshtein 4 days ago | parent [-]

My `dmesg` tells:

    EDAC MC0: Giving out device to module igen6_edac controller Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
    EDAC MC1: Giving out device to module igen6_edac controller Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
but `dmidecode --type 16` says:

    Error Correction Type: None
    Error Information Handle: Not Provided
c0l0 4 days ago | parent | next [-]

Are you sure you have ECC-capable DIMM installed?

What does

    find /sys/devices/system/edac/mc/mc0/csrow* -maxdepth 1 -type f -exec grep --color . {} +
report?
oneshtein 2 days ago | parent [-]

AFAIK, I have 2x DDR5 non-ECC memory (`dmidecode --type 17` says Samsung M425R1GB4BB0-CQKOL). Your command tells about SECDEC (single bit error correction, double bit error detection).

    /sys/devices/system/edac/mc/mc0/csrow0/ce_count:0
    /sys/devices/system/edac/mc/mc0/csrow0/ch0_dimm_label:MC#0_Chan#0_DIMM#0
    /sys/devices/system/edac/mc/mc0/csrow0/size_mb:8192
    /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
    /sys/devices/system/edac/mc/mc0/csrow0/ue_count:0
    /sys/devices/system/edac/mc/mc0/csrow0/mem_type:Unbuffered-DDR3
    /sys/devices/system/edac/mc/mc0/csrow0/edac_mode:SECDED
    /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
    /sys/devices/system/edac/mc/mc0/csrow0/ch1_dimm_label:MC#0_Chan#1_DIMM#0
    /sys/devices/system/edac/mc/mc0/csrow0/dev_type:x16
> find /sys/devices/system/edac/mc/mc0/csrow* -maxdepth 1 -type f -exec grep --color . {} +

It looks like DDR5 supports SECDEC by default. :-/

sunshowers 4 days ago | parent | prev [-]

To be honest I don't know how Intel works, my post is limited to AMD.

c0l0 5 days ago | parent | prev | next [-]

Usually, if a vendor's spec sheet for a (SOHO/consumer-grade) motherboard mentions ECC-UDIMM explicitly in its memory compatibility section, and (but this is a more recent development afaict) DOES NOT specify something like "operating in non-ECC mode only" at the same time, then you will have proper ECC (and therefore EDAC and RAS) support in Linux, if the kernel version you have can already deal with ECC on your platform in general.

I would assume your particular motherboard to operate with proper SECDED+-level ECC if you have capable, compatible DIMM, enable ECC mode in the firmware, and boot an OS kernel that can make sense of it all.

adrian_b 4 days ago | parent | prev [-]

This is weird. I have used many ASUS MBs specified as "can run ECC and non-ECC" and this has always meant that there was an ECC enabling option in the BIOS settings, and then if the OS had an appropriate EDAC driver for the installed CPU ECC worked fine.

I am writing this message on such an ASUS MB with a Ryzen CPU and working ECC memory. You must check that you actually have a recent enough OS to know your Threadripper CPU and that you have installed any software package required for this (e.g. on Linux "edac-utils" or something with a similar name).

jml7c5 4 days ago | parent | prev | next [-]

The big problem with ECC for me is that the sticks are so much more expensive. You'd expect ECC UDIMMs to have a price premium of just over 12.5% (because there are 9 chips instead of 8), but it's usually at least 100%. I don't mind paying reasonable premium for ECC, but paying double is too hard to swallow.

mr_toad 3 days ago | parent | next [-]

Trouble with enterprise is that the people buying care about the technology, but not the cost, while the people that do care about cost don’t understand the technology.

Some businesses (and governments) try and unify their purchasing, but this seems to make things worse, with the purchasing department both not understanding technology and being outwitted by vendors.

thewebguyd 3 days ago | parent | next [-]

> Trouble with enterprise is that the people buying care about the technology, but not the cost

Enterprise also ruins it for small/medium businesses as well, at least those with dedicated internal IT departments who do care about both the technology and the cost. We are left with unreliable consumer-grade hardware, or prohibitively expensive enterprise hardware.

There's very little in between. This market is also underserved with software/SaaS as well with the SSO Tax and whatnot. There's a huge gap between "I'm taking the owner's CC down to best buy" and "Enterprise" that gets screwed over.

wmf 3 days ago | parent | prev [-]

Enterprise IT is overpriced so you can negotiate a 50% discount. Unfortunately negotiating isn't worth it for something like a pair of DIMMs.

attila-lendvai 3 days ago | parent [-]

Enterprise IT is overpriced because it's mostly people spending other people's money... which usually has a lot of turbulence in its flow...

sippeangelo 4 days ago | parent | prev | next [-]

Yeah, with that kind of markup you might as well just buy new ones IF they break, or just spend the extra budget on better quality parts. Just having to pick a very specific motherboard that probably is very much not optimal for your build will blow the costs up even more, and for what gain?

I've been building my own gaming and productivity rigs for 20 years and I don't think memory has ever been a problem. Maybe survivorship bias, but surely even budget parts aren't THIS bad.

lmm 4 days ago | parent [-]

> with that kind of markup you might as well just buy new ones IF they break

Assuming you can tell, and assuming you don't end up silently corrupting your data before then.

coldtea 4 days ago | parent [-]

You take backups and do checksums for the valuable data. For the rest, who cares...

cma 3 days ago | parent [-]

Let's say you corrupted one bit in a blender asset 200 revisions ago and it was unnoticeable and still upgraded through five blender upgrades, but now on the sixth upgrade it fails with a corruption error and doesn't upgrade.

Without knowing how to fix that error you've lost 200 revisions of work. You can go back and find which revision had the problem, go before that, and upgrade it to the latest blender, but all your 200 revisions were made on other versions that you can't backport.

fluoridation 3 days ago | parent | next [-]

So don't upgrade it. Export it to an agnostic format and re-import it in the new version. Since it's failing to upgrade, it must be a metadata issue, not a data issue, so removing the Blender-specific bits will fix it.

What a silly hypothetical. There's a myriad freak occurrences that could make you have to redo work that you don't worry about. Now, I'm not saying single-bit errors don't happen. They just typically don't result in the sort of cascading failure you're describing.

cma 3 days ago | parent [-]

Doing a lossy export/reimport process probably isn't going to be viable on something like a big movie scene blender file with lots of constraints, scripted modifiers and stuff that doesn't automatically come through with an export to USD.

My point is that there are scenarios where corruption in the past puts you in a bind and can cause a lot of loss of work or expensive diagnostic and recovery process long after it first occurred, blender was just one example but it can be much worse with proprietary software binary formats where you don't have any chance of jumping into the debugger to figure out what's going wrong with an upgrade or export. And maybe the subscription version of it won't even let you go back to the old version.

> There's a myriad freak occurrences that could make you have to redo work that you don't worry about.

Yes other sources of corruption are more likely from things like software errors. It's not that you wouldn't worry about them if you had unlimited budget and could have people audit the code etc., but you do have a budget and ECC is much cheaper relative to that. That doesn't mean it always makes sense for everyone to pay more for ECC. But I can see why people working on gigantic CAD files for nuclear reactor design, etc. tend to have workstations with ECC.

fluoridation 3 days ago | parent [-]

>a big movie scene blender file with lots of constraints, scripted modifiers and stuff

Not really what I would call an "asset", but fine.

>It's not that you wouldn't worry about them if you had unlimited budget and could have people audit the code etc.

Hell, I was thinking something way simpler, like your cat climbing on the case and throwing up through the top vents, or you tripping and dropping your ass on your desk and sending everything flying.

>But I can see why people working on gigantic CAD files for nuclear reactor design, etc. tend to have workstations with ECC.

Yeah, because those people aren't buying their own machines. If the credit card is yours and you're not doing something super critical, you're probably better served by a faster processor than by worrying against freak accidents.

coldtea 3 days ago | parent | prev [-]

>Let's say you corrupted one bit in a blender asset 200 revisions ago and it was unnoticeable and still upgraded through five blender upgrades, but now on the sixth upgrade it fails with a corruption error and doesn't upgrade.

And let's say you have archived copies of it with checksums like I suggested, going back to all revisions ago.

What's the issue again now, that ECC would have solved? Not to mention that ECC wouldn't help at all with corruption at the disk level anyway.

fluoridation 3 days ago | parent [-]

>What's the issue again now, that ECC would have solved?

If the bit flip happened in RAM, the checksum would be of the corrupted data. ECC corrects single bit errors of data on RAM.

>Not to mention that ECC wouldn't help at all with corruption at the disk level anyway.

Yes, using ECC without ZFS, btrfs, ReFS, or checksummed file formats is pretty pointless (unless your application never touches storage).

varispeed 3 days ago | parent | prev [-]

You would think that competition would naturally regulate the price down, but it seems like we are dealing with some sort of a cartel that regulators have not caught up with yet.

consp 5 days ago | parent | prev | next [-]

Isn't it mostly an ease of mind thing? I've never seen a ECC error on my home server which has plenty of memory in use and runs longer than my desktop. Maybe it's more common with higher clocked, near the limit, desktop PC's.

Also: DDR5 has some false ecc marketing due to the memory standard having an error correction scheme build in. Don't fall for it.

adrian_b 4 days ago | parent | next [-]

Whether you will see ECC errors depends a lot on how much memory you have and how old it is.

A computer with 64 GB of memory is 4 times more likely to encounter memory errors than one with 16 GB of memory.

When DIMMs are new, at the usual amounts of memory for desktops, you will see at most a few errors per year, sometimes only an error after a few years. With old DIMMs, some of them will start to have frequent errors (such modules presumably had a borderline bad fabrication quality and now have become worn out, e.g. due to increased leakage leading to storing a lower amount of charge on the memory cell capacitors).

For such bad DIMMs, the frequency of errors will increase, and it may become of several errors per day, or even per hour.

For me, a very important advantage of ECC has been the ability to detect such bad memory modules (in computers that have been used for 5 years or more) and replace them before corrupting any precious data.

I also had a case with a HP laptop with ECC, where memory errors had become frequent after being stored for a long time (more than a year) in a rather humid place, which might have caused some oxidation of the SODIMM socket contacts, because removing the SODIMMs, scrubbing the sockets and reinserting the SODIMMs made disappear the errors.

fluoridation 3 days ago | parent [-]

>A computer with 64 GB of memory is 4 times more likely to encounter memory errors than one with 16 GB of memory.

No. Or well, not exactly. More bits will flip randomly, but if between the two systems only the total installed memory changed, both systems will see the same amount of memory errors, because bit flips on the additional 48 GB will not result in errors, because they will not be used. Memory errors scale with memory used not with memory installed.

c0l0 5 days ago | parent | prev | next [-]

I see a particular ECC error at least weekly on my home desktop system, because one of my DIMMs doesn't like the (out of spec) clock rate that I make it operate at. Looks like this:

    94 2025-08-26 01:49:40 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68aea758, cpuid=0x00a50f00, bank=0x00000012
    95 2025-09-01 09:41:50 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68b80667, cpuid=0x00a50f00, bank=0x00000012
(this is `sudo ras-mc-ctl --errors` output)

It's always the same address, and always a Corrected Error (obviously, otherwise my kernel would panic). However, operating my system's memory at this clock and latency boosts x265 encoding performance (just one of the benchmarks I picked when trying to figure out how to handle this particular tradeoff) by about 12%. That is an improvement I am willing to stomach the extra risk of effectively overclocking the memory module beyond its comformt zone for, given that I can fully mitigate it by virtue of properly working ECC.

Hendrikto 5 days ago | parent | next [-]

Running your RAM so far out of spec that it breaks down regularly, where do you take the confidence that ECC will still work correctly?

Also: Could you not have just bought slightly faste RAM, given the premium for ECC?

c0l0 5 days ago | parent [-]

"Breaks down" is a strong choice of words for a single, corrected bit error. ECC works as designed, and demonstrates that it does by detecting this re-occurring error. I take the confidence mostly from experience ;)

And no, as ECC UDIMM for the speed (3600MHz) I run mine at simply does not exist - it is outside of what JEDEC ratified for the DDR4 spec.

adithyassekhar 5 days ago | parent [-]

JEDEC rated DDR4 at only 2400mhz right? And anything higher is technically over clocking?

dijit 4 days ago | parent | next [-]

JEDEC has a few frequencies they support: https://www.jedec.org/standards-documents/docs/jesd79-4a

DDR4-1600 (PC4-12800)

DDR4-1866 (PC4-14900)

DDR4-2133 (PC4-17000)

DDR4-2400 (PC4-19200)

DDR4-2666 (PC4-21300)

DDR4-2933 (PC4-23466)

DDR4-3200 (PC4-25600) (the highest supported in the DDR4 generation)

What's *NOT* supported are some enthusiast ones that typically require more than 1.2v for example: 3600 MT/s, 4000 MT/s & 4266 MT/s

c0l0 5 days ago | parent | prev [-]

JEDEC specifies rates up to 3200MT/s, what's officially referred to as DDR4-3200 (PC4-25600).

kderbe 4 days ago | parent | prev | next [-]

I would loosen the memory timings a bit and see if that resolves the ECC errors. x265 performance shouldn't fall since it generally benefits more from memory clock rate than latency.

Also, could you share some relevant info about your processor, mainboard, and UEFI? I see many internet commenters question whether their ECC is working (or ask if a particular setup would work), and far fewer that report a successful ECC consumer desktop build. So it would be nice to know some specific product combinations that really work.

c0l0 4 days ago | parent [-]

I've been on AM4 for most of the past decade (and still am, in fact), and the mainboards I've personally had in use with working ECC support were:

  - ASRock B450 Pro4
  - ASRock B550M-ITX/ac
  - ASRock Fatal1ty B450 Gaming-ITX/ac
  - Gigabyte MC12-LE0
There's probably many others with proper ECC support. Vendor spec sheets usually hint at properly working ECC in their firmware if they mention "ECC UDIMM" support specifically.

As for CPUs, that is even easier for AM4: Everything that's not based on a APU core (there are some SKUs marketed without iGPU that just have the iGPU part of the APU disabled, such as the Ryzen 5 5500) cannot support ECC. An exception to that rule are "PRO"-series APUs, such as the Ryzen 5 PRO 5650G et al., which have an iGPU, but also support ECC. Main differences (apart from the integrated graphics) between CPU and APU SKUs is that the latter do not support PCIe 4.0 (APUs are limited to PCIe 3.0), and have a few Watts lower idle power consumption.

When I originally built the desktop PC that I still use (after a number of in-place upgrades, such as swapping out the CPU/GPU combo for an APU), I blogged about it (in German) here: https://johannes.truschnigg.info/blog/2020-03-23#0033-2020-0...

If I were to build an AM5 system today, I would look into mainboards from ASUS for proper ECC support - they seem to have it pretty much universally supported on their gear. (Actual out-of-band ECC with EDAC support on Linux, not the DDR5 "on-DIE" stuff.)

ainiriand 4 days ago | parent | prev [-]

I think you've found a particularly weak memory cell, I would start thinking about replacing that module. The consistent memory_channel=1, csrow=0 pattern confirms it's the same physical location failing predictably.

wpm 4 days ago | parent | prev | next [-]

I had a somewhat dodgy stick of used RAM (DDR4 UDIMM) in a Supermicro X11 board. This board is running my NAS, all ZFS, so RAM corruption can equal data corruption. The OS alerted me to recoverable errors on DIMM B2. Swapped it and another DIMM, rebooted, saw DIMM error on slot B1. Swapped it for a spare stick. No more errors.

This was running at like, 1866 or something. It's a pretty barebones 8th gen i3 with a beefier chipset, but ECC still came in clutch. I won't buy hardware for server purposes without it.

immibis 5 days ago | parent | prev | next [-]

I saw a corrected memory error logged every few hours when my current machine was new. It seems to have gone away now, so either some burn-in effect, or ECC accidentally got switched off and all my data is now corrupted. Threadripper 7000 series, 4x64GB DDR5.

Edit: it's probably because I switched it to "energy efficiency mode" instead of "performance mode" because it would occasionally lock up in performance mode. Presumably with the same root cause.

Jach 4 days ago | parent | prev | next [-]

I have a slightly older system with 128 GB of UDIMM DDR4 over four sticks. Ran just fine for quite a while but then I started having mysterious system freezes. Later discovered I had somehow disabled ECC error reporting in my system log on linux... once that was turned back on, oh, I see notices of recoverable errors. I finally found a repeatable way to trigger a freeze with a memory stress testing tool and that was from an unrecoverable error. I couldn't narrow the problem down to a single stick or RAM channel, it seemed to only happen if all 4 slots were occupied, but I eventually figured out that if I just lowered the RAM speed from standard 3200 MHz to the next officially supported (by the sticks) step of 2933 MHz, everything was fine again and no more ECC errors, recoverable or not. Been running like that since.

Last winter I was helping someone put together a new gaming machine... it was so frustrating running into the fake ecc marketing for DDR5 that you mention. The motherboard situation for whether they support it or not, or whether a bios update added support or then removed it or added it back or not, was also really sad. And even worse IMO is that you can't actually max out 4 slots on the top tier mobos unless you're willing to accept a huge drop in RAM speed. Leads to ugly 48 GB sized sticks and limiting to two of them... In the end we didn't go with ECC for that someone, but I was pretty disappointed about it. I'm hoping the next gen will be better, for my own setup running ZFS and such I'm not going to give up ECC.

hedora 4 days ago | parent | prev | next [-]

You have to go pretty far down the rabbit hole to make sure you’ve actually got ECC with [LP]DDR5

Some vendors use hamming codes with “holes” in them, and you need the CPU to also run ECC (or at least error detection) between ram and the cache hierarchy.

Those things are optional in the spec, because we can’t have nice things.

BikiniPrince 4 days ago | parent | prev | next [-]

I pick up old serves for my garage system. With edac it is a dream to isolate the fault and be instantly aware. It also lets you determine the severity of the issue. Dimms can run for years with just the one error or overnight explode into streams of corrections. I keep spares so it’s fairly easy to isolate any faults. It’s just how do you want to spend your time?

Scramblejams 4 days ago | parent | prev [-]

I run a handful of servers and I have a couple that pop ECC errors every year or three, so YMMV.

swinglock 5 days ago | parent | prev | next [-]

Excellent point. It's a shame and a travesty that data integrity is still mostly locked away inside servers, leaving most other computing devices effectively toys, the early prototype demo thing but then never finished and sold forever at inflated prices.

I wish AMD would make ECC a properly advertised feature with clear motherboard support. At least DDR5 has some level of ECC.

kevin_thibedeau 4 days ago | parent | next [-]

> At least DDR5 has some level of ECC.

That is mostly to assist manufacturers in selling marginal chips with a few bad bits scattered around. It's really a step backwards in reliability.

wpm 4 days ago | parent | prev [-]

I wish AMD wouldn't gate APU ECC support behind unobtainium "PRO" SKUs they only give out, seemingly, to your typical "business" OEMs and the rare Chinese miniPC company.

c0l0 4 days ago | parent [-]

It's not that dire as you make it out to be :)

Both the 8700G and the 8700G PRO are readily available in the EU, and the PRO SKU is about 50% more expensive (EUR 120 in absolute numbers): https://geizhals.eu/?cmp=3096260&cmp=3096300&cmp=3200470&act...

rendaw 4 days ago | parent | prev | next [-]

So I'm trying to learn more about this stuff, but aren't there multiple ECC flavors and the AMD consumer CPUs only support one of them (not the one you'd have on servers?)

Does anyone maintain a list with de-facto support of amd chips and mainboards? That partlist site only shows official support IIRC, so it won't give you any results.

adrian_b 4 days ago | parent | next [-]

The difference between the "unbuffered" ECC DIMMs (ECC UDIMMs), which you must use in desktop motherboards (and in some of those advertised as "workstation" MBs) and the "registered" ECC DIMMs (ECC RDIMMs), which you must use in server motherboards (and in some of the "workstation" MBs), has existed for decades.

However in the past there have existed very few CPU models and MBs that supported either kind of DIMMs, while today this has become completely impossible, as the mechanical and electrical differences between them have increased.

In any case, today, like also 20 years ago, when searching for ECC DIMMs you must always search only the correct type, e.g. unbuffered ECC DIMMs for desktop CPUs.

In general, registered ECC DIMMs are easier to find, because wherever "server memory" is advertised, that is what is meant. For desktop ECC memory, you must be careful to see both "ECC" and "unbuffered" mentioned in the module description.

hungmung 4 days ago | parent | prev [-]

Seconding this. I'm looking for a fanless industrial mini PC with out of band ECC and I'm having a hell of a time.

adrian_b 4 days ago | parent | next [-]

Had you been looking for "in-band ECC", the cheap ODROID H4 PLUS ($150) or the cheaper ODROID H4 ($110) would have been fine, or for something more expensive some of the variants of Asus NUC 13 Rugged support in-band ECC.

For out-of-band ECC, e.g. with standard ECC SODIMMs, all the embedded SBCs that I have seen used only CPUs that are very obsolete nowadays, i.e. ancient versions of Intel Xeon or old AMD industrial Ryzen CPUs (AMD's series of industrial Ryzen CPUs are typically at least one or two generations behind their laptop/desktop CPUs).

Moreover all such industrial SBCs with ECC SODIMMs were rather large, i.e. either in the 3.5" form factor or in the NanoITX form factor (120 mm x 120 mm), and it might have been necessary to replace their original coolers with bigger heatsinks for fanless operation.

In-band ECC causes a significant decrease of the performance, but for most applications of such mini-PCs the performance is completely acceptable.

nicman23 4 days ago | parent | prev [-]

https://www.asrockind.com/en-gb/iBOX-V2000V

something like that?

devnullbrain 4 days ago | parent | prev | next [-]

I like the warning not to buy a motherboard from a manufacturer that has been defunct for 17 years

storus 4 days ago | parent | prev | next [-]

Now where can I get 64GB ECC UDIMM DDR5 modules so that my X870E board can have 256GB RAM? The largest I found were just 48GB ECC UDIMMs or 64GB non-ECC UDIMMs.

c0l0 4 days ago | parent [-]

I don't think 64GB ECC UDIMM is commercially available yet. I use Geizhals to check for EU availability: https://geizhals.eu/?cat=ramddr3&xf=7500_DDR5~7501_DIMM~7761...

In my experience, it's generally unwise to push the platform you're on to the outermost of its spec'd limits. At work, we bought several 5950X-based Zen3 workstations with 128GB of 3200MT/s ECC UDIMM, and two of these boxes will only ever POST when you manually downclock memory to 3000MT/s. Past a certain point, it's silicon lottery deciding if you can make reality live up to the datasheets' promises.

storus 4 days ago | parent [-]

I am fine with downclocking the RAM; my X870E board (ProArt) should be fine running ECC, I only use 9800X3D to have a single CCD (maybe upgraded later to EPYC 4585PX) and together have RTX 6000 Pro and 2x NVLinked A6000 in PCIe slots, with two M.2 SSDs. Power supply follows the latest specs as well. This build was meant to be a light-weight Threadripper replacement and ECC is a must for my use cases (it's a build for my summer house so that I can do serious work while there).

unethical_ban 3 days ago | parent | prev | next [-]

Any specific recommendations? I am having random, OS agnostic lockups on my ryzen 1xxx build and thought DDR5 will be enough, but true ECC sounds good.

edit: Looks like a lot of Asus motherboards work, and the thing to look for is "unbuffered" ECC. Kingston has some, I see 32GB module for $190 on Newegg.

moffkalast 4 days ago | parent | prev | next [-]

Do you live at a very high altitude with a significant amount of solar radiation, or at an underfunded radiology lab or perhaps near a uranium deposit or a melted down nuclear reactor? Because the average machine should never see a memory bit flip error at all during its entire lifetime.

rkomorn 4 days ago | parent | next [-]

Then how do you explain all the bugs in the software I write?!

moffkalast 4 days ago | parent [-]

It truly is a cosmic mystery :)

yndoendo 3 days ago | parent | prev [-]

Bit flipping can be the byproduct of bow the system components harmonize. Role hammer RAM also has the same affect. [0]

[0] https://en.m.wikipedia.org/wiki/Row_hammer

moffkalast 2 days ago | parent [-]

> Furthermore, research shows that precisely targeted three-bit Rowhammer flips prevents ECC memory from noticing the modifications.

Doesn't exactly sound like a use case for ECC memory, given that it can't correct these attacks. Interesting though, I'd have thought that virtual addresses would've largely fixed this.

wer232essf 4 days ago | parent | prev [-]

[dead]

4 days ago | parent [-]
[deleted]