| ▲ | londons_explore 7 hours ago | ||||||||||||||||||||||||||||||||||
I wanna see an inference chip where the weights are part of the rom of the chip. There would be 1 multiplier per weight (and since they're constant, the whole thing turns into a bunch of simple adders), and the total pipelined system throughput would be one token per clock cycle. That means you can probably have millions of users simultaneously using a single bit of silicon, with perhaps 500 million tokens per second coming out the output bus. Downside is this chip would be huuuuge - a whole wafer. Wafer level faults probably won't matter though - neural nets are resistant to a few missing or wrong weights. Due to the speed the industry moves, you'd want to race from model weights to production super fast, make 50 wafers, use them for a year, then bin them when that model is obsolete. | |||||||||||||||||||||||||||||||||||
| ▲ | sometimelurker 6 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||
this appeared some time ago, https://taalas.com/, but I'm sure there's others thinking these same thoughts. this would be best for small models imo, nothing frontier because that changes too fast | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | Smaug123 6 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
By the way, you've seen Cerebras? It's not gone as far as what you described - loads of cores and RAM but you still load up the weights onto it as software and they need to be streamed into the chip for large models - but it is a whole wafer. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | phkahler 6 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
>> I wanna see an inference chip where the weights are part of the rom of the chip. I've been wondering about that for a while now. For a lot of tasks putting weights in ROM is probably OK. OTOH: >> There would be 1 multiplier per weight... I'm not sure that is a good idea. Maybe if its quantized down to 2 bits... Otherwise maybe a small ROM near each multiplier (or row of them or whatever) so the multipliers could handle N distinct matrix operations without having to move the data from far away. Another fun thought is to have a row of MAC units on DRAM so a DRAM row would be a vector. Row size might be 64Kbit or 8K weights if they're 8bit. This also keeps the weights and calcs on the same chip. I'm not sure this would put enough multipliers on one chip though. Systolic arrays can have tens or hundreds of thousands each doing one op per clock cycle. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | mdp2021 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
> weights [as] part of the rom of the chip Not really that: you are pointing to Compute-In-Memory (CIM) - techniques where the data (here, a multiplier value) is part of the processor (here, the multiplying circuit). The problem of "fetch and process" is bypassed completely architecturally: the data is there where the processing happens - it's not moved, there is no latency. | |||||||||||||||||||||||||||||||||||
| ▲ | Salgat 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Supposedly memristors would be ideal for this (and it would be reprogrammable), but then again, memristors seem to be the carbon nanotubes of the computing world. | |||||||||||||||||||||||||||||||||||
| ▲ | yuriyguts 6 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
I've also been thinking about this. Although the forward pass of a transformer model also involves some heavier operations like normalization, reciprocals, exponentiations or other non-linearities (GeLU, SiLU) which may (though typically don't) involve learned weights as operands. | |||||||||||||||||||||||||||||||||||
| ▲ | 6 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
| [deleted] | |||||||||||||||||||||||||||||||||||
| ▲ | HDThoreaun an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
How would the pipelining work when the next token depends on the last token? | |||||||||||||||||||||||||||||||||||
| ▲ | zkmon 6 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
firmware upgrade would mean flashing a huge BIN file. | |||||||||||||||||||||||||||||||||||
| ▲ | cruffle_duffle 6 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||
“ Wafer level faults probably won't matter though - neural nets are resistant to a few missing or wrong weights.” Brain science people “love” traumatic brain injury cases because it can help explore what happens when bits of the “brain wafer” get damaged. We’ve learned a lot from such things. I wonder if people are intentionally “destroying” parts of the model weights to learn more about what happens? Like could you strategically wipe a gig of the model so it’s “all zeros” and see what happens? I have to wonder | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||