There are nonlinearities to exploit in that calculus. Given enough VRAM to host a larger model that you're targeting, just the size can push you past the usability threshold at a much better price.

▲

ycui7 2 hours ago | parent | next [-]

When you get 4 of these, the idle power alone is 120W. That is a lot of electricity if left on 24/7.

At that power consumption, you also end up being more expensive than API calls and many times slower. It starts to feel very stupid to run local interference.

If the client is very keen on privacy, then they can pay for the NVIDIA.

I end up returning my B70s, and bought RTX PRO 6000.

▲

ycui7 2 hours ago | parent | prev [-]

Problem is the more B70 you have, the slower the inference it gets(due to terrible software atm). A single B70 is almost barely faster than CPU inference. If you have 4 B70, you might as well run interference on CPU and be faster with cheaper DDR5 instead of GDDR6.

	▲	adrian_b an hour ago \| parent [-]
		For what you say to be useful, please specify what sowftware you have used with B70, including its version. Hardware-wise a B70 should be significantly faster than any of the available CPUs at ML inference. If it was not so in your tests, that must really be a software problem, so you must identify the software, for others to know what does not work.