Why CUDA translation wont unlock AMD

Let’s just say what it is: devs are too constrained to jump ship right now. It’s a massive land grab and you are not going to spend time tinkering with CUDA alternatives when even a six-month delay can basically kill your company/organization. Google and Apple are two companies with enough resources to do it. Google isn’t because they’re keeping it proprietary to their cloud. Apple still have their heads stuck in sand barely capable of fixing Siri.

▲

stingraycharles 3 hours ago | parent [-]

Google has their own TPUs so they don’t have any vendor lock-in issues at all.

OpenAI OTOH is big enough that the vendor lock-in is actually hurting them, and them making that massive deal with AMD may finally push the needle for AMD and improve things in the ecosystem to make AMD a smooth experience.

▲

apfsx 3 hours ago | parent [-]

Google’s TPU’s are not powering Gemini or whatever X equivalent LLM you want to compare to.

▲

rfw300 3 hours ago | parent | next [-]

This isn't true. Gemini is trained and run almost entirely on TPUs. Anthropic also uses TPUs for inference, see, e.g., https://www.anthropic.com/news/expanding-our-use-of-google-c... and https://www.anthropic.com/engineering/a-postmortem-of-three-.... OpenAI also uses TPUs for inference at least in some measure: https://x.com/amir/status/1938692182787137738?t=9QNb0hfaQShW....

▲

jsolson an hour ago | parent | prev | next [-]

This comment is incorrect: https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...

▲

skirmish 2 hours ago | parent | prev | next [-]

I can assure you that most internal ML teams are using TPUs both for training and inference, they are just so much easier to get. Whatever GPUs exist are either reserved for Google Cloud customers, or loaned temporarily to researchers who want to publish easily externally reproducible results.

▲

stingraycharles 2 hours ago | parent | prev | next [-]

They are, even Apple famously uses Google Cloud for their cloud based AI stuff solely because of Apple not wanting to buy NVidia.

Google Cloud does have a lot of NVidia, but that’s for their regular cloud customers, not internal stuff.

▲

rdudek 3 hours ago | parent | prev [-]

What is powering Gemini?

	▲	jsolson an hour ago \| parent [-]
		TPUs: https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...

▲

mandevil 4 hours ago | parent | prev | next [-]

Yeah, ROCm focused code will always beat generic code compiled down. But this is a really difficult game to win.

For example, Deepseek R-1 released optimized for running on Nvidia HW, and needed some adaption to run as well on ROCm. This was for the exact same reasons that ROCm code will beat generic code compiled into ROCm, in the same way. Basically the Deepseek team, for their own purposes, created R-1 to fit Nvidia's way of doing things (because Nvidia is market-dominant) on their own. Once they released, someone like Elio or AMD would have to do the work of adapting the code to run best on ROCm.

For more established players who weren't out-of-left-field surprises like Deepseek, e.g. Meta's Llama series, mostly coordinate with AMD ahead of release day, but I suspect that AMD still has to pay for the engineering work themselves while Meta does the work to make it run on Nvidia themselves. This simple fact, that every researcher makes their stuff work on CUDA themselves, but AMD or someone like Elio has to do the work to move it over to get it to be as performant on ROCm, that is what keeps people in the CUDA universe.

	▲	latchkey an hour ago \| parent [-]
		Kimi is the latest model that isn't running correctly on AMD. Apparently close to Deepseek in design, but different enough that it just doesn't work. It isn't just the model, it is the engine to run it. From what I understand this model works with sglang, but not with vLLM.

▲

fulafel an hour ago | parent | prev | next [-]

Vulkan Compute is catching up quick with HIP (or whatever the compatibility stuff is called now), which seems like a welcome break from CUDA: https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan

▲

buggyworld 3 hours ago | parent | prev | next [-]

This reminds me of the database wire protocol debates. PostgreSQL-compatible databases (like Aurora, Neon, Supabase) achieve compatibility by speaking the Postgres wire protocol, but the truly successful ones don't just translate—they rebuild core components to leverage their own architecture (Aurora's storage layer, Neon's branching, etc.).

The article frames this as "CUDA translation bad, AMD-native good" but misses the strategic value of compatibility layers: they lower switching costs and expand the addressable market. NVIDIA's moat isn't just technical—it's the ecosystem inertia. A translation layer that gets 80% of NVIDIA performance might be enough to get developers to try AMD, at which point AMD-native optimization becomes worth the investment.

The article is essentially a product pitch for Paiton disguised as technical analysis. The real question isn't "should AMD hardware pretend to be CUDA?" but rather "what's the minimum viable compatibility needed to overcome ecosystem lock-in?" PostgreSQL didn't win by being incompatible—it won by being good AND having a clear migration path from proprietary databases.

▲

apitman an hour ago | parent | prev | next [-]

Our open source library is currently hard locked into CUDA due to nvCOMP for gzip decompression (bioinformatics files). What I wouldn't give for an open source implementation, especially if it targeted WebGPU.

▲

manjose2018 3 hours ago | parent | prev | next [-]

https://geohot.github.io//blog/jekyll/update/2025/03/08/AMD-...

https://tinygrad.org/ is the only viable alternative to CUDA that I have seen popup in the past few years.

▲

erichocean 3 hours ago | parent | next [-]

Both Mojo and ThunderKittens/HipKittens are viable on AMD.

	▲	balaclava9 2 hours ago \| parent [-]
		Mojo runs faster on nvidia hardware than CUDA in some cases. https://x.com/clattner_llvm/status/1982196673771139466?s=61

▲

bigyabai 3 hours ago | parent | prev [-]

Viable how? "Feasible" might be a better word here, I haven't heard many (any?) war-stories about a TinyBox in production but maybe I'm OOTL.

▲

martinald 4 hours ago | parent | prev | next [-]

Perhaps I'm misunderstanding the market dynamics; but isn't AMDs real opp inference over research?

Training etc still happens on NVDA but inference is somewhat easy to do on vLLM et al with a true ROCm backend with little effort?

▲

measurablefunc 5 hours ago | parent | prev | next [-]

Why can't it be done w/ AI? Why does it need to be maintained w/ manual programming? Take the ROCm specification, take your CUDA codebase, let one of the agentic AIs translate it all into ROCm or the AMD equivalent.

▲

jsheard 5 hours ago | parent | next [-]

The article is literally about how rote translation of CUDA code to AMD hardware will always give sub-par performance. Even if you wrangled an AI into doing the grunt work for you, porting heavily-NV-tuned code to not-NV-hardware would still be losing strategy.

▲

measurablefunc 4 hours ago | parent [-]

The point of AI is that it is not a rote translation & 1:1 mapping.

▲

jsheard 4 hours ago | parent [-]

> Take the ROCm specification, take your CUDA codebase, let one of the agentic AIs translate it all into ROCm

...sounds like asking for a 1:1 mapping to me. If you meant asking the AI to transmute the code from NV-optimal to AMD-optimal as it goes along, you could certainly try doing that, but the idea is nothing more than AI fanfic until someone shows it actually working.

▲

measurablefunc 4 hours ago | parent [-]

Now that I have clarified the point about AI optimizing the code from CUDA to fit AMD's runtime what is your contention about the possibility of such a translation?

▲

bigyabai 3 hours ago | parent [-]

There is an old programmer's joke about writing abstractions and expecting zero-cost.

	▲	measurablefunc 3 hours ago \| parent [-]
		How does that apply in this case? The whole point is that the agentic AI/AGI skips all the abstractions & writes optimized low-level code for each GPU vendor from a high-level specification. There are no abstractions other than whatever specifications GPU vendors provide for their hardware which are fed into the agentic AI/AGI to do the necessary work of creating low-level & optimized code for specific tasks.

▲

colonCapitalDee 3 hours ago | parent | prev | next [-]

No. This is far beyond the capabilities of current AI, and will remain so for the foreseeable future. You could let your model of choice churn on this for months, and you will not get anywhere. It will be able to reach a somewhat working solution quickly, but it will soon reach a point where for every issue it fixes, it introduces one or more issues or regressions. LLMs are simply not capable of scaffolding complexity like a human, and lack the clarity and rigorousness of thought required to execute an *extremely* ambitious project like performant CUDA to ROCm translation.

▲

measurablefunc 3 hours ago | parent [-]

I don't have a model of choice. I'm just going by what I hear on twitter from Sam Altman & the people who work for him.

▲

colonCapitalDee 2 hours ago | parent | next [-]

Well that's your problem. Here's a tip: just because someone says something doesn't mean you have to listen to them

	▲	measurablefunc an hour ago \| parent [-]
		That is very wise. I'll have to keep that in mind.

▲

bigyabai an hour ago | parent | prev [-]

This explains everything.

▲

cbarrick 5 hours ago | parent | prev | next [-]

Has this been done successfully at scale?

There's a lot of handwaving in this "just use AI" approach. You have to figure out a way to guarantee correctness.

	▲	measurablefunc 4 hours ago \| parent [-]
		There are tons of test suites so if the tests pass then that provides a reasonable guarantee of correctness. Although it would be nice if there was also proof of correctness for the compilation from CUDA to AMD.

▲

bee_rider 3 hours ago | parent | prev | next [-]

The AI is too busy making Ghibli profile pictures or whatever the thing is now.

We asked it to make a plan for how to fix the situation, but it got stuck.

“Ok, I’m helping the people build an AI to translate NVIDIA codes to AMD”

“I don’t have enough resources”

“Simple, I’ll just use AMD chips to run an AI code translator, they are under-utilized. I’ll make a step by step process to do so”

“Step 1: get code kernels for the AMD chips”

And so on.

	▲	measurablefunc 3 hours ago \| parent [-]
		The real question is whether it will be as unprofitable to do this type of automated runtime translation from one GPU vendor to another as it is to generate Mario clips & Ghibli images.

▲

j16sdiz 4 hours ago | parent | prev | next [-]

The same as "Why just outsourcing it to <some country >"

AI aint magic.

You need more effort to manage, test and validate that.

▲

measurablefunc 4 hours ago | parent [-]

Isn't AGI around the corner? If it is then this is a very simple problem that should be solvable w/ existing pre-AGI capabilities.

▲

j16sdiz 4 hours ago | parent | next [-]

I am not saying this is impossible, but I am down voting this because this is _not an interesting discussion_.

The whole point of having an online discussion forum is to exchange and create new ideas. What you are advocating is essentially "maybe we can stop generating new ideas because we don't have to. we should just sit and wait"... Well, yes, no, maybe. but this is not what I expect to get from here.

	▲	measurablefunc 4 hours ago \| parent [-]
		You can do whatever you want & I didn't ask you to participate in my thread so unless you are going to address the actual points I'm making instead of telling me it is not interesting then we don't have anything to discuss further.

▲

j16sdiz 4 hours ago | parent | prev | next [-]

So, your strategy for solving this is: Convert it to another harder problem (AGI). Now it is somebody else (AI researcher)'s problem.

This is outsourcing the task to AI researchers.

	▲	measurablefunc 4 hours ago \| parent [-]
		They keep promising that this kind of capability is right around the corner & they keep showing how awesome they are at passing math exams so why is this a more difficult problem than solving problems in abstract algebra & scheme theory on humanity's last exam or whatever is the latest & greatest benchmark for mathematical capabilities?

▲

nutjob2 3 hours ago | parent | prev [-]

> Isn't AGI around the corner?

There isn't even a concrete definition of intelligence, let alone AGI, so no it's not.

That's just mindless hype at this point.

	▲	measurablefunc 3 hours ago \| parent [-]
		Are you trying to say that Sam Altman & Elon Musk have been less than honest about their prognosis of luxurious & automated future utopias?

▲

Blackthorn 4 hours ago | parent | prev | next [-]

I don't know why you're being downvoted because even if you're Not Even Wrong, that's exactly the sort of thing that has been endlessly presented by people trying to sell AI as something that AI will absolutely do for us.

▲

measurablefunc 4 hours ago | parent [-]

Let's see who else manages to catch on to the real point I'm making.

	▲	bigyabai an hour ago \| parent [-]
		It's hard to catch-on to a deliberately dishonest pretense. You could clone 10,000 John Carmacks to do the job for you, Nvidia would still be a $5 trillion business next time you wake up.

▲

bigyabai 5 hours ago | parent | prev [-]

Because it doesn't work like that. TFA is an explanation of how GPU architecture dictates the featureset that is feasibly attainable at runtime. Throwing more software at the problem would not enable direct competition with CUDA.

	▲	measurablefunc 4 hours ago \| parent [-]
		I am assuming that is all part of the specification that the agentic AI is working with & since AGI is right around the corner I think this is a simple enough problem that can be solved with AI.

▲

kj4ips 4 hours ago | parent | prev | next [-]

I agree pretty strongly. A translation layer like this is making an intentional trade: Giving up performance and HW alignment for less lead time and effort to make a proper port.

▲

latchkey 28 minutes ago | parent | prev | next [-]

A bit of background. This is directed towards Spectral Compute (Michael) and https://scale-lang.com/. I know both of these guys personally and consider them both good friends, so you have to understand a bit of the background in order to really dive into this.

My take on it is fairly well summed up at the bottom of Elio's post. In essence, Elio is taking the view of "we would never use scale-lang for llms because we have a product that is native AMD" and Michael is taking the view of "there is a ton of CUDA code out there that isn't just AI and we can help move those people over to AMD... oh and by the way, we actually do know what we are doing, and we think we have a good chance at making this perform."

At the end of the day, both companies (my friends) are trying to make AMD a viable solution in a world dominated by an ever growing monopoly. Stepping back a bit and looking at the larger picture, I feel this is fantastic and want to support both of them in their efforts.

▲

outside1234 5 hours ago | parent | prev | next [-]

Are the hyperscalers really using CUDA? This is what really matters. We know Google isn't. Are AWS and Azure for their hosting of OpenAI models et al?

	▲	wmf 4 hours ago \| parent \| next [-]
		All Nvidia GPUs, which are probably >70% of the market, use CUDA.
	▲	bigyabai 4 hours ago \| parent \| prev [-]
		> We know Google isn't. Google isn't internally, so far as we know. Google's hyperscaler products have long offered CUDA options, since the demand isn't limited to AI/tensor applications that cannibalize TPU's value prop: https://cloud.google.com/nvidia

▲

jmward01 3 hours ago | parent | prev | next [-]

Right now we need diversity in the ecosystem. AMD is finally getting mature and hopefully that will lead to them truly getting a second, strong, opinion into ecosystem. The friction this article talks about is needed to push new ideas.

▲

doctorpangloss 4 hours ago | parent | prev | next [-]

All they have to do is release air cooled 96GB GDDR7 PCIe5 boards with 4x Infinity Link, and charge $1,900 for it.

▲

pixelpoet 5 hours ago | parent | prev [-]

Actual article title says "won't"; wont is a word meaning habit or proclivity.

	▲	InvisGhost 5 hours ago \| parent [-]
		In situations like this, I try to focus on whether the other person understood what was being communicated rather than splitting hairs. In this case, I don't think anyone would be confused.