I feel like there's no point to get a graphics card nowadays. Clearly, graphics cards are optimized for graphics; they just happened to be good for AI but based on the increased significance of AI, I'd be surprised if we don't get more specialized chips and specialized machines just for LLMs. One for LLMs, a different one for stable diffusion.

With graphics processing, you need a lot of bandwidth to get stuff in and out of the graphics card for rendering on a high-resolution screen, lots of pixels, lots of refreshes, lots of bandwidth... With LLMs, a relatively small amount of text goes in and a relatively small amount of text comes out over a reasonably long amount of time. The amount of internal processing is huge relative to the size of input and output. I think NVIDIA and a few other companies already started going down that route.

But probably graphics cards will still be useful for stable diffusion; especially AI-generated videos as the inputs and output bandwidth is much higher.

▲

ACCount37 10 hours ago | parent | next [-]

Nah, that's just plain wrong.

First, GPGPU is powerful and flexible. You can make an "AI-specific accelerator", but it wouldn't be much simpler or much more power-efficient - while being a lot less flexible. And since you need to run traditional graphics and AI workloads both in consumer hardware? It makes sense to run both on the same hardware.

And bandwidth? GPUs are notorious for not being bandwidth starved. 4K@60FPS seems like a lot of data to push in or out, but it's nothing compared to how fast modern PCIe 5.0 x16 goes. AI accelerators are more of the same.

	▲	djsjajah 8 hours ago \| parent [-]
		GPUs might not be bandwidth starved most of the time, but they absolutely are when generating text from an llm. It’s the whole reason why low precision floating point numbers are being pushed by nvidia.

▲

Legend2440 a day ago | parent | prev | next [-]

LLMs are enormously bandwidth hungry. You have to shuffle your 800GB neural network in and out of memory for every token, which can take more time/energy than actually doing the matrix multiplies. GPUs are almost not high bandwidth enough.

▲

socketcluster 20 hours ago | parent | next [-]

But even so, for a single user, the output rate for a very fast LLM would be like 100 tokens per second. With graphics, we're talking like 2 million pixels, 60 times a second; 120 million pixels per second for a standard high res screen. Big difference between 100 tokens vs 120 million pixels.

24 bit pixels gives 16 million possible colors... For tokens, it's probably enough to represent every word of the entire vocabulary of every major national language on earth combined.

> You have to shuffle your 800GB neural network in and out of memory

Do you really though? That seems more like a constraint imposed by graphics cards. A specialized AI chip would just keep the weights and all parameters in memory/hardware right where they are and update them in-situ. It seems a lot more efficient.

I think that it's because graphics cards have such high bandwidth that people decided to use this approach but it seems suboptimal.

But if we want to be optimal; then ideally, only the inputs and outputs would need to move in and out of the chip. This shuffling should be seen as an inefficiency; a tradeoff to get a certain kind of flexibility in the software stack... But you waste a huge amount of CPU cycles moving data between RAM, CPU cache and Graphics card memory.

	▲	djsjajah 8 hours ago \| parent \| next [-]
		> Do you really though? Yes. It stays in on the hbm but it need to get shuffled to the place where it can actually do the computation. It’s a lot like a normal cpu. The cpu can’t do anything with data in the system memory, it has to be loaded into a cpu register. For every token that is generated, a dense llm has to read every parameter in the model.
	▲	visarga 8 hours ago \| parent \| prev [-]
		If we did that it would be much more expensive, keeping all weights in SRAM is done by Groq for example.

▲

Zambyte a day ago | parent | prev [-]

This doesn't seem right. Where is it shuffling to and from? My drives aren't fast enough to load the model every token that fast, and I don't have enough system memory to unload models to.

▲

Legend2440 a day ago | parent | next [-]

From VRAM to the tensor cores and back. On a modern GPU you can have 1-2tb moving around inside the GPU every second.

This is why they use high bandwidth memory for VRAM.

	▲	Zambyte 12 hours ago \| parent [-]
		This makes sense now, thanks!

▲

zamadatix a day ago | parent | prev | next [-]

If you're using a MoE model like DeepSeek V3 the full model is 671 GB but only 37 GB are active per token, so it's more like running a 37 GB model from the memory bandwidth perspective. If you do a quant of that it could e.g. be more like 18 GB.

▲

smallerize a day ago | parent | prev | next [-]

You're probably not using an 800GB model.

▲

p1esk a day ago | parent | prev [-]

It is right. The shuffling is from CPU memory to GPU memory, and from GPU memory to GPU. If you don’t have enough memory you can’t run the model.

▲

Zambyte 12 hours ago | parent [-]

How can I observe it being loaded into CPU memory? When I run a 20gb model with ollama, htop reports 3gb of total RAM usage.

▲

zamadatix 12 hours ago | parent | next [-]

Think of it like loading a moving truck where:

- The house is the disk

- You are the RAM

- The truck is the VRAM

There won't be a single time you can observe yourself carrying the weight of everything being moved out of the house because that's not what's happening. Instead you can observe yourself taking many tiny loads until everything is finally moved, at which point you yourself should not be loaded as a result of carrying things from the house anymore (but you may be loaded for whatever else you're doing).

Viewing active memory bandwidth can be more complicated than it'd seem to set up, so the easier way is to just view your VRAM usage as you load in the model freshly into the card. The "nvtop" utility can do this for most any GPU on Linux, as well as other stats you might care about as you watch LLMs run.

▲

Zambyte 7 hours ago | parent [-]

My confusion was on the shuffling process happening per token. If this was happening per token, it would be effectively the same as loading the model from disk every token.

	▲	p1esk 5 hours ago \| parent [-]
		The model might get loaded on every token - from GPU memory to GPU. This depends on how much of it is cached on GPU. Inputs to every layer must be loaded as well. Also, if your model doesn’t fit in GPU memory but fits in CPU memory, and you’re doing GPU offloading, then you’re also shuffling between CPU and GPU memory.

▲

p1esk 11 hours ago | parent | prev [-]

Depends on map_location arg in torch.load: might be loaded straight to GPU memory

▲

zamadatix a day ago | parent | prev | next [-]

> Clearly, graphics cards are optimized for graphics; they just happened to be good for AI

I feel like the reverse has been true since after the Pascal era.

▲

autoexec a day ago | parent | prev [-]

I don't doubt that there will be specialized chips that make AI easier, but they'll be more expensive than the graphics cards sold to consumers which means that a lot of companies will just go with graphics cards, either because the extra speed of specialized chips won't be worth the cost, or will they'll be flat out too expensive and priced for the small number of massive spenders who'll shell out insane amounts of money for any/every advantage (whatever they think that means) they can get over everyone else.