Remix.run Logo
trod1234 3 days ago

It is a little bit more complicated than ROCm simply not having support because ROCm has at a point claimed support, and they've had to walk it back painfully (multiple times). Its not a driver issue, nor a hardware issue on their side.

There has been a long-standing issue between AMD and its mainboard manufacturers. The issue has to do with features required for ROCm, namely PCIe Atomics. AMD has been unable or unwilling to hold the mainboard manufacturers to account for advertising features the mainboard does not support.

The CPU itself must support this feature, but the mainboard must as well (in firmware).

One of the reasons why ROCm hasn't worked in the past is because the mainboard manufacturers have claimed and advertised support for PCIe Atomics, and the support they've claimed has been shown to be false, and the software fails in non-deterministic ways when tested. This is nightmare fuel for the few AMD engineers tasked with ROCm.

PCIe Atomics requires non-translated direct IO to operate correctly, and in order to support the same CPU models from multiple generations they've translated these IO lines in firmware.

This has left most people that query their system to check this showing PCIAtomics is supported, while when actual tests that rely on that support are done they fail, in chaotic ways. There is no technical specification or advertising that the mainboard manufacturers provide showing whether this is supported. Even the boards with multiple x16 slots and the many technologies related to it such as Crossfire/SLI/mGPU brandings these don't necessarily show whether PCIAtomics is properly supported.

In other words, the CPU is supported, the firmware/mainboard fail with no way to differentiate between the two at the upper layers of abstraction.

All in all. You shouldn't be blaming AMD for this. You should be blaming the three mainboard manufacturers who chose to do this. Some of these manufacturers have upper end boards where they actually did do this right they just chose to not do this for any current gen mainboard costing less than ~$300-500.

fancyfredbot 3 days ago | parent | next [-]

Look, this sounds like a frustrating nightmare, but the way it seems to us consumers is that AMD chose to rely on poorly implemented and supported technology, and Nvidia didn't. I can't blame AMD for the poor support by motherboards manufacturers but I can and will blame AMD for relying on it.

trod1234 3 days ago | parent [-]

While we won't know for sure, unless someone from AMD comments on this; in fairness there may not have been any other way.

Nvidia has a large number of GPU related patents.

The fact that AMD chose to design their system this way, in such a roundabout and brittle manner, which is contrary to how engineer's approach things, may have been a direct result of being unable to design such systems any other way because of broad patents tied to the interface/GPU.

fancyfredbot 3 days ago | parent [-]

I feel like this issue is to at least some extent a red herring. Even accepting that ROCm doesn't work on some motherboards, this can't explain why so few of AMD's GPUs have official ROCm support.

I notice that at one point there was a ROCm release which said it didn't require atomics for gfx9 GPUs, but the requirement was reintroduced in a later version of ROCm. Not sure what happened there but this seems to suggest AMD might have had a workaround at some point (though possibly it didn't work).

If this really is due to patent issues AMD can likely afford to licence or cross-license the patent given potential upside.

It would be in line with other decisions taken by AMD if they took this decision because it works well with their datacentre/high-end GPUs, and they don't (or didn't) really care about offering GPGPU to the mass/consumer GPU market.

zozbot234 3 days ago | parent | next [-]

> why so few of AMD's GPUs have official ROCm support

Because "official ROCm support" means "you can rely on AMD to make this work on your system for your critical needs". If you want "support" in the "you can goof around with this stuff on your own and don't care if there's any breakage" sense, ROCm "supports" a whole lot of AMD hardware. They should just introduce a new "experimental, unsupported" tier and make this official on their end.

trod1234 3 days ago | parent | prev | next [-]

> I feel like this issue is to at least some extent a red herring.

I don't see that, these two issues adequately explain why so few GPUs have official support. They don't want to get hit with a lawsuit, as a result of issues outside their sphere of control.

> If this really is due to patent issues AMD can likely afford to license or cross-license the patent given potential upside.

Have you ever known any company willing to cede market dominance and license or cross-license a patent letting competition into a market that they hold an absolute monopoly over, let alone in an environment where antitrust is non-existent and fang-less?

There is no upside for NVIDIA to do that. If you want to do serious AI/ML work you currently need to use NVIDIA hardware, and they can charge whatever they want for that.

The moment you have a competitor, demand is halved at a bare minimum depending on how much the competitor undercuts you by. Any agreement on coordinating prices leads to price-fixing indictments.

fancyfredbot 3 days ago | parent [-]

> I don't see that, these two issues adequately explain why so few GPUs have official support.

I'm sorry I don't follow this. Surely if all AMD GPUs have the same problem with atomics then this can't explain why some GPUs are supported and others aren't?

> There is no upside for NVIDIA to do that.

If NVIDIA felt this patent was actually protecting them from competition then there would be no upside. But NVIDIA has competiton from AMD, Intel, Google, and Amazon. Intel have managed to engineer OneAPI support for their GPUs without licensing this patent or relying on PCIe atomics.

AMD have patents NVIDIA would be interested in. For example multi-chiplet GPUs.

wkat4242 3 days ago | parent | prev [-]

And why the support is dropped so quickly too.

wongarsu 3 days ago | parent | prev | next [-]

There are so many hardware certification programs out there, why doesn't AMD run one to fix this?

Create a "ROCm compatible" logo and a list of criteria. Motherboard manufacturers can send a pre-production sample to AMD along with a check for some token amount (let's say $1000). AMD runs a comprehensive test suite to check actual compatibility, if it passes the mainboard is allowed to be advertised and sold with the previously mentioned logo. Then just tell consumers to look for that logo if they want to use ROCm. If things go wrong on a mainboard without the certification, communicate that it's probably the mainboard's fault.

Maybe add some kind of versioning scheme to allow updating requirements in the future

spacebanana7 3 days ago | parent | prev | next [-]

How does NVIDIA manage this issue? I wonder whether they have a very different supply chain or just design software that puts less trust in the reliability of those advertised features.

bigyabai 3 days ago | parent | next [-]

I should point out here, if nobody has already; Nvidia's GPU designs are extremely complicated compared to what AMD and Apple ship. The "standard" is to ship a PCIe card with display handling drivers and some streaming multiprocessor hardware to process your framebuffers. Nvidia goes even further by adding additional accelerators (ALUs by way of CUDA core and tensor cores), onboard RTOS management hardware (what Nvidia calls GPU System Processor), and more complex userland drivers that very well might be able to manage atomics without any PCIe standards.

This is also one of the reasons AMD and Apple can't simply turn their ship around right now. They've both invested heavily in simplifying their GPU and removing a lot of the creature-comforts people pay Nvidia for. 10 years ago we could at least all standardize on OpenCL, but these days it's all about proprietary frameworks and throwing competitors under the bus.

kimixa 3 days ago | parent [-]

FYI AMD also has similar "accelerators", with the 9070 having separate ALU paths for wmma ("tensor") operations much like Nvidia's model - older RDNA2/3 architectures had accelerated instructions but used the "normal" shader ALUs, if a bit beefed up and tweaked to support multiple smaller data types. And CUDA cores are just what Nvidia call their normal shader cores. Pretty much every subunit on a geforce has a direct equivalent on a radeon - they might be faster/slower or more/less capable, but they're there and often at an extremely similar level of the design.

AMD also have on-die microcontrollers (multiple, actually) that do things like scheduling or pipeline management, again just like Nvidia's GSP. It's been able to schedule new work on-GPU with zero host system involvement since the original GCN, something that Nvidia advertise as "new" with them introducing their GSP (which just replaced a slightly older, slightly less capable controller rather than being /completely/ new too)

The problem is that AMD are a software follower right now - after decades of under-investment they're behind on the treadmill just trying to keep up, so when the Next Big Thing inevitably pops up they're still busy polishing off the Last Big Thing.

I've always seen AMD as a hardware company, with the "build it and they will come" approach - which seems to have worked for the big supercomputers who likely find it worth investing in their own modified stack to get that last few %, but clearly falls down selling to "mere" professionals. Nvidia, however, support the same software APIs on even the lowest end hardware, while nobody is likely running much on their laptop's 3050m in anger, it offers a super easy on-ramp for developers - and it's easy to mistake familiarity with superiority - you already know to avoid the warts so you don't get burned by them. And believe me, CUDA has plenty of warts.

And marketing - remember "Technical Marketing" is still marketing - and to this day lots of people believe that the marketing name for something, or branding a feature, implies anything about the underlying architecture design - go to an "enthusiast" forum and you'll easily find people claiming that because Nvidia call their accelerator a "core" means it's somehow superior/better/"more accelerated" than the direct equivalent on a competitor, or actually believe that it just doesn't support hardware video encoding as it "Doesn't Have NVENC" (again, GCN with video encoding was released before a Geforce with NVENC). Same with branding - AMD hardware can already read the display block's state and timestamp in-shader, but Everyone Knows Nvidia Introduced "Flip Metering" With Blackwell!

trod1234 3 days ago | parent | prev [-]

Its an open question they have never answered afaik.

I would speculate that their design is self-contained in hardware.

zozbot234 3 days ago | parent | prev | next [-]

AIUI, AMD documentation claims that the requirement for PCIe Atomics is due to ROCm being based on Heterogeneous System Architecture, https://en.wikipedia.org/wiki/Heterogeneous_System_Architect... which allows for a sort of "unified memory" (strictly speaking, a unified address space) across CPU and GPU RAM. Other compute API's such as CUDA, OpenCL, SYCL or Vulkan Compute don't have HSA as a strict requirement but ROCm apparently does.

pjc50 3 days ago | parent | prev [-]

So .. how's Nvidia dealing with this? Or do they benefit from motherboard manufacturers doing preferential integration testing?