▲ | bob1029 3 days ago | |||||||
> You simply cannot compute the gates for the entire sequence in one shot, because each step requires the output from the one before it. This forces a sequential loop, which is notoriously inefficient on parallel hardware like GPUs. > The crux of the paper is to remove this direct dependency. The simplified models, minGRU and minLSTM, redefine the gates to depend only on the current input The entire hypothesis of my machine learning experiments has been that we should embrace the time domain and causal dependencies. I really think biology got these elements correct. Now, the question remains - Which kind of computer system is most ideal to run a very branchy and recursive workload? Constantly adapting our experiments to satisfy the constraints of a single kind of compute vendor is probably not healthy for science over the long term. | ||||||||
▲ | fennecbutt 2 days ago | parent | next [-] | |||||||
Absolutely, I remember an article from ages ago about a self learning algo implemented on an fpga (I think) that could modify its own make up on a hardware level. It ended up optimising in a way that wasn't obvious at first, but turned out to be the noise of one part interacting with another. Aha: Here's the paper https://osmarks.net/assets/misc/evolved-circuit.pdf And a fluff article https://www.damninteresting.com/on-the-origin-of-circuits And as per usual, Google was hopeless in finding the article from a rough description. No chance, at all. Chatgpt thought for 10s and delivered the correct result, first time. | ||||||||
▲ | jstanley 3 days ago | parent | prev | next [-] | |||||||
> Which kind of computer system is most ideal to run a very branchy and recursive workload? An analogue one, possibly? | ||||||||
| ||||||||
▲ | tripplyons 2 days ago | parent | prev | next [-] | |||||||
The output of the recurrence is still dependent on previous tokens, but it usually less expressive within the recurrence in order make parallelism possible. In MinGRU the main operation used to share information between tokens is addition (with a simple weighting). You could imagine after one layer of the recurrence, the tokens already have some information about each other, so the input to the following layers is dependent on previous tokens, although the dependence is indirect compared to a traditional RNN. | ||||||||
▲ | inciampati 3 days ago | parent | prev | next [-] | |||||||
It turns out you can use a fused triton kernel for a true RNN GRU and run just as fast as the minGRU model in training. Yeah, it doesn't work for very long context but neither does minGRU (activation memory...) | ||||||||
▲ | nickpsecurity 3 days ago | parent | prev [-] | |||||||
Analog or FPGA. Cerebras' wafer-scale technology could help. |