I have read in the past that ASICs for LLMs are not as simple a solution compared to cryptocurrency. In order to design and build the ASIC you need to commit to a specific architecture: a hashing algorithm for a cryptocurrency is fixed but the LLMs are always changing.

Am I misunderstanding "TPU" in the context of the article?

▲

HarHarVeryFunny 6 hours ago | parent | next [-]

Regardless of architecture (which is anyways basically the same for all LLMs), the computational needs of modern neural networks are pretty generic, centered around things like matrix multiply, which is what the TPU provides. There is even TPU support for some operations built into PyTorch - it is not just a proprietary interface that Google use themselves.

▲

kcb 3 hours ago | parent | prev | next [-]

LLMs require memory and interconnect bandwidth so needs a whole package that is capable of feeding data to the compute. Crypto is 100% compute bound. Crypto is a trivially parallelized application that runs the same calculation over N inputs.

▲

olalonde 4 hours ago | parent | prev | next [-]

"Application-specific" doesn't necessarily mean unprogrammable. Bitcoin miners aren't programmable because they don't need to be. TPUs are ASICs for ML and need to be programmable so they can run different models. In theory, you could make an ASIC hardcoded for a specific model, but given how fast models evolve, it probably wouldn't make much economic sense.

▲

immibis 3 hours ago | parent | prev | next [-]

Cryptocurrency architectures also change - Bitcoin is just about the lone holdout that never evolves. The hashing algorithm for Monero is designed so that a Monero hashing ASIC is literally just a CPU, and it doesn't even matter what the instruction set is.

▲

p-e-w 6 hours ago | parent | prev [-]

It’s true that architectures change, but they are built from common components. The most important of those is matrix multiplication, using a relatively small set of floating point data types. A device that accelerates those operations is, effectively, an ASIC for LLMs.

▲

bfrog 6 hours ago | parent [-]

We used to call these things DSPs

▲

tuhgdetzhh 5 hours ago | parent [-]

What is the difference between a DSP and Asic? Is a GPU a DSP?

▲

bfrog 5 hours ago | parent | next [-]

DSP is simply a compute architecture that focuses on mutliply and accumulate operations on particular numerical formats, often either fixed point q15/q31 type values or floats f16/f32.

The basic operation that a NN needs accelerating is... go figure multiply and accumulate with the added activation function.

See for example how the Intel NPU is structured here: https://intel.github.io/intel-npu-acceleration-library/npu.h...

▲

imtringued 5 hours ago | parent | prev | next [-]

A DSP contains analog to digital and digital to analog converters plus DMA for fast transfers to main memory and fixed function blocks for finite impulse response and infinite pulse response filters.

The fact that they also support vector operations or matrix multiplication is kind of irrelevant and not a defining characteristic of DSPs. If you want to go that far, then everything is a DSP, because all signals are analog.

	▲	bfrog 5 hours ago \| parent \| next [-]
		See here https://intel.github.io/intel-npu-acceleration-library/npu.h... Maybe also note that Qualcomm has renamed their Hexagon DSP to Hexagon NN. Likely the change was adding activation functions but otherwise its a VLIW architecture with accelerated MAC operations, aka a DSP architecture.
	▲	bryanlarsen 4 hours ago \| parent \| prev [-]
		I've worked on DSP's with none of those things. Well, they did have DMA.

▲

duped 5 hours ago | parent | prev [-]

ASICs bake one algorithm into the chip. DSPs are programmable, like GPUs or CPUs. The thing that historically set them apart were MAC/FMA and zero overhead loops. Then there are all the nice to haves, like built in tables of FFT twiddle factors, helpers for 1D convolution, vector instructions, fixed point arithmetic, etc.

What makes a DSP different from a GPU is the algorithms typically do not scale nicely to large matrices and vectors. For example, recursive filters. They are also usually much cheaper and lower power, and the reason they lost popularity was because Arm MCUs got good enough and economy of scale kicked in.

I've written code for DSPs both in college and professionally. It's much like writing code for CPUs or MCUs (it's all C or C++ at the end of the day). But it's very different from writing compute shaders or designing an ASIC.