I wanna see an inference chip where the weights are part of the rom of the chip.

There would be 1 multiplier per weight (and since they're constant, the whole thing turns into a bunch of simple adders), and the total pipelined system throughput would be one token per clock cycle.

That means you can probably have millions of users simultaneously using a single bit of silicon, with perhaps 500 million tokens per second coming out the output bus.

Downside is this chip would be huuuuge - a whole wafer.

Wafer level faults probably won't matter though - neural nets are resistant to a few missing or wrong weights.

Due to the speed the industry moves, you'd want to race from model weights to production super fast, make 50 wafers, use them for a year, then bin them when that model is obsolete.

▲

sometimelurker 6 hours ago | parent | next [-]

this appeared some time ago, https://taalas.com/, but I'm sure there's others thinking these same thoughts. this would be best for small models imo, nothing frontier because that changes too fast

▲

1e1a 5 hours ago | parent [-]

you can try it out here: https://chatjimmy.ai/

▲

Meetvelde an hour ago | parent [-]

that's so fast it feels fake

	▲	the_sleaze_ an hour ago \| parent [-]
		13,789 tok/s Well I've gotten one of those "holy fuck this is the future" deeply unsettled anxious feelings in my gut again. It's been a week or 2, it was time.

▲

Smaug123 6 hours ago | parent | prev | next [-]

By the way, you've seen Cerebras? It's not gone as far as what you described - loads of cores and RAM but you still load up the weights onto it as software and they need to be streamed into the chip for large models - but it is a whole wafer.

	▲	trouve_search 6 hours ago \| parent \| next [-]
		Cerebras is a whole lot of SRAM, basically a ton more L1/L2 cache, hence increasing throughput. They're pretty supply constrained right now though and their production costs seem prohibitive. The interesting players at the moment are from Toronto: taalas (print the model onto the silicon) and tenstorrent (dataflow programming based hardware)
	▲	londons_explore 5 hours ago \| parent \| prev [-]
		There is a huge downside to weights being modifiable - it means you need to have multipliers (not simply adders), and SRAM to store those weights. I suspect for equal performance, that's probably a 5x increase in silicon area (and therefore cost).

▲

phkahler 6 hours ago | parent | prev | next [-]

>> I wanna see an inference chip where the weights are part of the rom of the chip.

I've been wondering about that for a while now. For a lot of tasks putting weights in ROM is probably OK. OTOH:

>> There would be 1 multiplier per weight...

I'm not sure that is a good idea. Maybe if its quantized down to 2 bits... Otherwise maybe a small ROM near each multiplier (or row of them or whatever) so the multipliers could handle N distinct matrix operations without having to move the data from far away.

Another fun thought is to have a row of MAC units on DRAM so a DRAM row would be a vector. Row size might be 64Kbit or 8K weights if they're 8bit. This also keeps the weights and calcs on the same chip. I'm not sure this would put enough multipliers on one chip though. Systolic arrays can have tens or hundreds of thousands each doing one op per clock cycle.

▲

cyptus 6 hours ago | parent [-]

analog chips could also be very interessting instead of using digital signals and processing them against the weights in the ROM. I have no idea if that scales with such big models though.

	▲	mdp2021 2 hours ago \| parent [-]
		The drawback is in keeping signal fidelity (e.g. dissipation, temperature etc.) and in the conversion between analogue and digital. Nonetheless, yes, there are already implemented solutions for small NNs (I understand mostly acting as triggers).

▲

mdp2021 2 hours ago | parent | prev | next [-]

> weights [as] part of the rom of the chip

Not really that: you are pointing to Compute-In-Memory (CIM) - techniques where the data (here, a multiplier value) is part of the processor (here, the multiplying circuit).

The problem of "fetch and process" is bypassed completely architecturally: the data is there where the processing happens - it's not moved, there is no latency.

▲

Salgat 3 hours ago | parent | prev | next [-]

Supposedly memristors would be ideal for this (and it would be reprogrammable), but then again, memristors seem to be the carbon nanotubes of the computing world.

▲

yuriyguts 6 hours ago | parent | prev | next [-]

I've also been thinking about this. Although the forward pass of a transformer model also involves some heavier operations like normalization, reciprocals, exponentiations or other non-linearities (GeLU, SiLU) which may (though typically don't) involve learned weights as operands.

▲

6 hours ago | parent | prev | next [-]

[deleted]

▲

HDThoreaun an hour ago | parent | prev | next [-]

How would the pipelining work when the next token depends on the last token?

▲

zkmon 6 hours ago | parent | prev | next [-]

firmware upgrade would mean flashing a huge BIN file.

▲

cruffle_duffle 6 hours ago | parent | prev [-]

“ Wafer level faults probably won't matter though - neural nets are resistant to a few missing or wrong weights.”

Brain science people “love” traumatic brain injury cases because it can help explore what happens when bits of the “brain wafer” get damaged. We’ve learned a lot from such things.

I wonder if people are intentionally “destroying” parts of the model weights to learn more about what happens? Like could you strategically wipe a gig of the model so it’s “all zeros” and see what happens?

I have to wonder

▲

mdp2021 2 hours ago | parent | next [-]

Of course tampering with chunks or nodes in the NNs is a way to study the "spawned" (through gradient descent etc.) configuration and "reverse-engineer the black box" to get "AI transparency".

Anthropic published an important work around one year and a half ago.

▲

zurfer 6 hours ago | parent | prev | next [-]

This is called mechanistic interpretability. There is lots of fascinating insights already since you can do basically everything down to the neuron or weight level thousands of times. The human brain is many orders of magnitude harder to make sense of.

	▲	sometimelurker 6 hours ago \| parent [-]
		well its actually called ablation, and its one way to do mech interp. anthriopics got a bunch of work on mech interp here https://transformer-circuits.pub/, like SAEs and NLAs

▲

Cantinflas 5 hours ago | parent | prev | next [-]

▲

Computer0 6 hours ago | parent | prev [-]

Reminds me of Golden Gate Claude (https://www.anthropic.com/news/golden-gate-claude)