Remix.run Logo
bob1029 2 days ago

> You are training the neural network (a big block of matrices) using gradient descent (a trick that modifies the big block of matrices) to do something.

Things get even more mysterious (and efficient) if you introduce non-linearities, recurrence and a time domain. These mainstream ANN architectures are fairly pedestrian by comparison. You can visualize and walk through the statistical distributions. Everything is effectively instantaneous within each next token generation. It's hard to follow an action potential around in time because it's constantly being integrated with other potentials and its influence is only relevant for a brief moment.