FPGAs will never rival gpus or TPUs for inference. The main reason is that GPUs aren't really gpus anymore. 50% of the die area or more is for fixed function matrix multiplication units and associated dedicated storage. This just isn't general purpose anymore. FPGAs cannot rival this with their configurable DSP slices. They would need dedicated systolic blocks, which they aren't getting. The closest thing is the versal ML tiles, and those are entire peoxessors, not FPGA blocks. Those have failed by being impossible to program.

▲

fpgaminer 6 hours ago | parent | next [-]

> FPGAs will never rival gpus or TPUs for inference. The main reason is that GPUs aren't really gpus anymore.

Yeah. Even for Bitcoin mining GPUs dominated FPGAs. I created the Bitcoin mining FPGA project(s), and they were only interesting for two reasons: 1) they were far more power efficient, which in the case of mining changes the equation significantly. 2) GPUs at the time had poor binary math support, which hampered their performance; whereas an FPGA is just one giant binary math machine.

	▲	beeflet 6 hours ago \| parent [-]
		I have wondered if it is possible to make a mining algorithm FPGA-hard in the same way that RandomX is CPU-hard and memory-hard. Relative to CPUs, the "programming time" cost is high. Nice username btw.

▲

Lerc 6 hours ago | parent | prev | next [-]

I think it'll get to a point with quantisation that GPUs that run them will be more FPGA like than graphics renderers. If you quantize far enough things begin to look more like gates than floating point units. At that level a FPGA wouldn't run your model, it would be one your model.

▲

ithkuil 7 hours ago | parent | prev | next [-]

Turns out that a lot of interesting computation can be expressed as a matrix multiplication.

	▲	fooblaster 7 hours ago \| parent [-]
		Yeah, I wouldn't have guessed it would be helping me write systemverilog.

▲

dnautics 3 hours ago | parent | prev | next [-]

I don't think this is correct. For inference, the bottleneck is memory bandwidth, so if you can hook up an FPGA with better memory, it has an outside shot at beating GPUs, at least in the short term.

I mean, I have worked with FPGAs that outperform H200s in Llama3-class models a while and a half ago.

	▲	fooblaster 2 hours ago \| parent [-]
		Show me a single FPGA that can outperform a B200 at matrix multiplication (or even come close) at any usable precision. B200 can do 10 peta ops at fp8, theoretically. I do agree memory bandwidth is also a problem for most FPGA setups, but xilinx ships HBM with some skus and they are not competitive at inference as far as I know.

▲

alanma 6 hours ago | parent | prev [-]

yup, GBs are so much tensor core nowadays :)