Time to first token is a very important performance metric, as I figured out using a Mac Studio M3 Ultra (that is quite slow on this aspect).

But 32GB for a TDP of 230W is perhaps not super interesting. Especially because you probably want to have more than one card. It's a lot of heat. You could use the cards for heating up a building, but heatpumps exist.

▲

bigyabai 12 hours ago | parent [-]

A lot of the TDP is reserved for running the shader units at full-power. My RTX 3070 Ti only pulls ~110w of it's 320w running CUDA inference on Gemma 26b and E4B.

▲

Scaevolus 12 hours ago | parent | next [-]

It's not that it's reserving power, but rather that you hit some bottleneck on a 3070 Ti before running into thermal limits-- it's likely limited by either tensor core saturation or RAM throughput. Running the workload with Nvidia's profiling tools should make the bottleneck obvious.

▲

lambda 11 hours ago | parent [-]

Generally the bottleneck is RAM throughput. Inference, in particular token generation, especially on a single user instance, is not all that computationally complex; you're doing some fairly simple calculations for each parameter, the time is dominated by just transferring each parameter from RAM to the cores. A 31B dense model like Gemma 4 has to transfer 31B parameters (at 16 bits per parameter for the full model, though on consumer hardware people generally run 4-8 bit quantizations) from RAM to the cores, that's a lot of memory transfer.

Prompt processing or parallel token generation can do a bit more work per memory transfer, as you can use the same weights for a few different calculations in parallel. But even still, memory bandwidth is a huge factor.

	▲	8 hours ago \| parent [-]
		[deleted]

▲

ycui7 2 hours ago | parent | prev | next [-]

B70 idles at 30W, while RTX PRO 4500 idles at 9W (measured to be 5W at wall).

B70 runs at 1/3 token output rate of RTX PRO 4500 and consume 3X idle power when do nothing.

▲

culopatin 4 hours ago | parent | prev | next [-]

My 4070 super and 5070 super both max out their tdp when I use them with ollama, is your usage different?

▲

gambiting 10 hours ago | parent | prev [-]

My 5090 runs at full TDP(pretty much exactly 575W) when running inference through LM Studio.

▲

rao-v 7 hours ago | parent [-]

Cap the power to 400W you won’t see much impact

	▲	gardnr 6 hours ago \| parent [-]
		Same throughput with much less heat. Not sure what that extra 175w is going towards but it's diminishing returns.