Remix.run Logo
hackrmn 5 days ago

I find the piece, much like a lot of other documentation, "imprecise". Like most such efforts, it likely caters to a group of people expected to benefit from being explained what a GPU is, but it fumbles it terms, e.g. (the first image with burned-in text):

> The "Warp Scheduler" is a SIMD vector unit like the TPU VPU with 32 lanes, called "CUDA Cores"

It's not clear from the above what a "CUDA core" (singular) _is_ -- this is the archetypical "let me explain things to you" error most people make, in good faith usually -- if I don't know the material, and I am out to understand, then you have gotten me to read all of it but without making clear the very objects of your explanation.

And so, for these kind of "compounding errors", people who the piece was likely targeted at, are none the wiser really, while those who already have a good grasp of the concepts attempted explained, like what a CUDA core actually is, already know most of what the piece is trying to explain anyway.

My advice to everyone who starts out with a back of envelope cheatsheet then decides to publish it "for the good of mankind", e.g. on Github: please be surgically precise with your terms -- the terms are your trading cards, then come the verbs etc. I mean this is all writing 101, but it's a rare thing, evidently. Don't mix and match terms, don't conflate them (the reader will do it for you many times over for free if you're sloppy), and be diligent with analogies.

Evidently, the piece may have been written to help those already familiar with TPU terminology -- it mentions "MXU" but there's no telling what that is.

I understand I am asking for a tall order, but the piece is long and all the effort that was put in, could have been complemented with minimal extra hypertext, like annotated abbreviations like "MXU".

I can always ask $AI to do the equivalent for me, which is a tragedy according to some.

jacobaustin123 5 days ago | parent | next [-]

Shamelessly responding as the author. I (mostly) agree with you here.

> please be surgically precise with your terms

There's always a tension between precision in every explanation and the "moral" truth. I can say "a SIMD (Single Instruction Multiple Data) vector unit like the TPU VPU with 32 ALUs (SIMD lanes) which NVIDIA calls CUDA Cores", which starts to get unwieldy and even then leaves terms like vector units undefined. I try to use footnotes liberally, but you have to believe the reader will click on them. Sidenotes are great, but hard to make work in HTML.

For terms like MXU, I was intending this to be a continuation of the previous several chapters which do define the term, but I agree it's maybe not reasonable to assume people will read each chapter.

There are other imprecisions here, like the term "Warp Scheduler" is itself overloaded to mean the scheduler, dispatch unit, and SIMD ALUs, which is kind of wrong but also morally true, since NVIDIA doesn't have a name for the combined unit. :shrug:

I agree with your points and will try to improve this more. It's just a hard set of compromises.

hackrmn 5 days ago | parent | next [-]

I appreciate your response. I made a point of not revising my comment after posting it and finding in a subsequent parable the following, quoting:

> Each SM is broken up into 4 identical quadrants, which NVIDIA calls SM subpartitions, each containing a Tensor Core, 16k 32-bit registers, and a SIMD/SIMT vector arithmetic unit called a Warp Scheduler, whose lanes (ALUs) NVIDIA calls CUDA Cores.

And right after:

> CUDA Cores: each subpartition contains a set of ALUs called CUDA Cores that do SIMD/SIMT vector arithmetic.

So, to your defense and my shame -- you *did* do better than I was able to infer from first glance. And I can take absolutely no issue with a piece elaborating on originally "vague" sentence later on -- we need to read top to bottom, after all.

Much of the difficulty with laying out knowledge in written word is inherent constraints like choosing between deferring detail to "further down" at the expense of giving the "bird's eye view". I mean there is a reason writing is hard, technical writing perhaps more so, in a way. You're doing much better than a lot of other stuff I've had to learn with, so I can only thank you to have done as much as you already have.

To be more constructive still, I agree the border between clarity and utility isn't always clearly drawn. But I think you can think of it as a service to your readers -- go with precision I say -- if you really presuppose the reader should know SIMD, chances are they are able to grok a new definition like "SIMD lane" if you define it _once_ and _well_. You don't need to be "unwieldy" in repetition -- the first time may be hard but you only need to do it once.

I am rambling. I do believe there are worse and better ways to impart knowledge of the kind in writing, but I too obviously don't have the answers, so my criticism was in part inconstructive, just a sheer outcry of mild frustration once I started conflating things from the get go but before I decided to give it a more thorough read.

One last thing though: I always like when a follow-up article starts with a preamble along of "In the previous part of the series..." so new visitors can simultaneously become aware there's prior knowledge that may be assumed, _and_ navigate their way to desired point in the series, all the way to the start perhaps. That frees you from e.g. wanting to annotate abbreviations in every part, if you want to avoid doing that.

jacobaustin123 3 days ago | parent [-]

Thank you for taking the time to write this reply. Agree with "in the previous part of this series" comment. I'll try to find a way to highlight this more.

What I'd like to add to this page is some sort of highly clear glossary that defines all the terms at the top (but in some kind of collapsable fashion) so I can define everything with full clarity without disrupting the flow. I'll play with the HTML and see what I can do.

abirch 5 days ago | parent | prev | next [-]

1) Thank you for writing this.

2) What are your thoughts on links to the wiki articles under things such as "SIMD" or "ALUs" for the precise meaning while using the metaphors in your prose?

Most novices tend to Google and end up on Wikipedia for the trees. It's harder to find the forest.

lotyrin 5 days ago | parent | prev | next [-]

I feel you handle this balance quite gracefully, to the point where I was impressed at your handling of the issue while reading and before checking the comments section. I don't know why the idea of something being called something by marketing or documentation (names which one must, strategically accept and internalize) but fundamentally and functionally being better described with other language isn't clearer (which is more useful, so also needed) to the grandparent poster. You want people to be aware of both and explain both without dwelling or getting caught on it, it struck me as an artful choice.

socalgal2 4 days ago | parent | prev [-]

I often put requirements at the top of article

> This article assumes you've read [this] and [this] and understand > [this topic] and [this topic too]

I'm not sure that's helpful, and, I don't put everything. Those links might also have further links saying you need X, Y, and Z. But at least there is a trail on where to start

einpoklum 5 days ago | parent | prev | next [-]

> It's not clear from the above what a "CUDA core" (singular) _is_

A CUDA core is basically a SIMD lane on an actual core on an NVIDIA GPUs.

For a longer version of this answer: https://stackoverflow.com/a/48130362/1593077

pklausler 5 days ago | parent [-]

So it's a "SIMD lane" that can itself perform actual SIMD instructions?

I think you want a metaphor that doesn't also depend on its literal meaning.

corysama 5 days ago | parent | next [-]

Nvidia’s marketing team uses confusing terminology to make their product sound cooler than it is.

An Intel “core” can perform AVX512 SIMD instructions that involve 16 lanes of 32-bit data. Intel cores are packaged in groups of up to 16. And, they use hyperthreading, speculative execution and shadow registers to cover latency.

An Nvidia “Streaming Multiprocessor” can perform SIMD instructions on 32 lanes of 32-bits each. Nvidia calls these lanes “cores” to make it feel like one GPU can compete with thousands of Intel CPUs.

Simpler terminology would be: an Nvidia H100 has 114 SM Cores, each with four 32-wide SIMD execution units (where basic instructions have a latency of 4 cycles) and each with four Tensor cores. That’s a lot more capability than a high-end Intel CPUs, but not 14,592 times more.

The CUDA API presents a “CUDA Core” (single SIMD lane) as if it was a thread. But, for most purposes it is actually a single SIMD lane in the 32-wide “Warp”. Lots of caveats apply in the details though.

bee_rider 5 days ago | parent | next [-]

I guess “GPUs for people who are already CPU experts” is a blog post that already exists out there. But if it doesn’t, you should go write it, haha.

shaklee3 4 days ago | parent | prev [-]

This is not true. GPUs are SIMT, but any given thread in those 32 in a warp can also issue SIMD instructions. see vector loads

saltcured 5 days ago | parent | prev | next [-]

It's all very circular, if you try to avoid the architecture-specific details of individual hardware designs. A SIMD "lane" is roughly equivalent to an ALU (arithmetic logic unit) in a conventional CPU design. Conceptually, it processes one primitive operation such as add, multiple, or FMA (fused-multiply-add) at a time on scalar values.

Each such scalar operation is on a fixed width primitive number, which is where we get into the questions of what numeric types the hardware supports. E.g. we used to worry about 32 vs 64 bit support in GPUs and now everything is worrying about smaller widths. Some image processing tasks benefit from 8 or 16 bit values. Lately, people are dipping into heavily quantized models that can benefit from even narrower values. The narrower values mean smaller memory footprint, but also generally mean that you can do more parallel operations with "similar" amounts of logic since each ALU processes fewer bits.

Where this lane==ALU analogy stumbles is when you get into all the details about how these ALUs are ganged together or in fact repartitioned on the fly. E.g. a SIMD group of lanes share some control signals and are not truly independent computation streams. Different memory architectures and superscalar designs also blur the ability to count computational throughput, as the number of operations that can retire per cycle becomes very task-dependent due to memory or port contention inside these beasts.

And if a system can reconfigure the lane width, it may effectively change a wide ALU into N logically smaller ALUs that reuse most of the same gates. Or, it might redirect some tasks to a completely different set of narrower hardware lanes that are otherwise idle. The dynamic ALU splitting was the conventional story around desktop SIMD, but I think is less true in modern designs. AFAICT, modern designs seem more likely to have some dedicated chip regions that go idle when they are not processing specific widths.

einpoklum 5 days ago | parent | prev | next [-]

> that can itself perform actual SIMD instructions?

Mostly, no; it can't really perform actual SIMD instructions itself. If you look at the SASS (the assembly language used on NVIDIA GPUs) I don't believe you'll see anything like that.

In high-level code, you do have expressions involving "vectorized types", which look like they would translate into SIMD instruction, but they 'serialize', at the single thread level.

There are exceptions to this though, like FP16 operations which might work on 2xFP16 32-bit registers, and other cases. But that is not the rule.

pklausler 5 days ago | parent | next [-]

Please see https://docs.nvidia.com/cuda/parallel-thread-execution/index....

einpoklum 4 days ago | parent [-]

The "video instructions" are indeed another exception: Operations on sub-lanes of 32-bit values: 2x16 or 4x8. This is relevant for graphics/video work, where you often have Red, Green, Blue, Alpha channels of 8 bits each. Their use is uncommon (AFAICT) in CUDA compute work.

shaklee3 4 days ago | parent | prev [-]

not true; there are a lot of simd instructions on GPUs

einpoklum 4 days ago | parent [-]

Such as?

shaklee3 4 days ago | parent [-]

dp4a, ldg. just Google it. there's a whole page of them

5 days ago | parent | prev | next [-]
[deleted]
pavlov 5 days ago | parent | prev [-]

Nvidia calls their SIMD lanes “CUDA cores” for marketing reasons.

hyghjiyhu 5 days ago | parent | prev | next [-]

Interestingly, I find llms are really good for this problem; when looking up one term just leads to more unknown terms and you struggle to find a starting point from which to understand the rest, I have found that they can tell you where to start.

evertedsphere 5 days ago | parent | prev | next [-]

https://cloud.google.com/tpu/docs/system-architecture-tpu-vm

should have most of it

robbies 5 days ago | parent | prev | next [-]

I’m being earnest: what is an appropriate level of computer architecture knowledge? SIMD is 50 years old.

From the resource intro: > Expected background: We’re going to assume you have a basic understanding of LLMs and the Transformer architecture but not necessarily how they operate at scale.

I suppose this doesn’t require any knowledge about how computers work, but core CPU functionality seems…reasonable?

Symmetry 5 days ago | parent [-]

SIMD is quite old but the changes Nvidia made to call it SIMT and that they used as an excuse to call their vector lanes "cores" are quite a bit newer.

pseudosavant 5 days ago | parent | prev | next [-]

My recursive brain got a chuckle out of wondering about "imprecise" being in quotes. I found the quotes made the meaning a touch...imprecise.

While I can understand the imprecise point, I found myself very impressed by the quality of the writing. I don't envy making digestible prose about the differences between GPUs and TPUs.

uberduper 5 days ago | parent | prev [-]

This is a chapter in a book targeting people working in the machine learning domain.