Remix.run Logo
nikki93 6 days ago

Pasting a comment I posted elsewhere:

Resources I’ve liked:

Sebastian Raschka book on building them from scratch

Deep Learning a Visual Approach

These videos / playlists:

https://youtube.com/playlist?list=PLoROMvodv4rOY23Y0BoGoBGgQ... https://youtube.com/playlist?list=PLoROMvodv4rOwvldxftJTmoR3... https://youtube.com/playlist?list=PL7m7hLIqA0hoIUPhC26ASCVs_... https://www.youtube.com/live/uIsej_SIIQU?si=RHBetDNa7JXKjziD

here’s a basic impl that i trained on tinystories to decent effect: https://gist.github.com/nikki93/f7eae83095f30374d7a3006fd5af... (i used claude code a lot to help with the above bc a new field for me) (i did this with C and mlx before but ultimately gave into the python lol)

but overall it boils down to:

- tokenize the text

- embed tokens (map each to a vector) with a simple NN

- apply positional info so each token also encodes where it is

- do the attention. this bit is key and also very interesting to me. there are three neural networks: Q, K, V – that are applied to each token. you then generate a new sequence of embeddings where each position has the Vs of all tokens added up weighted by the Q of that position dot’d with the K of the other position. the new embeddings are /added/ to the previous layer (adding like this is called ‘residual’)

- also do another NN pass without attention, again adding the output (residual) there’s actually multiple ‘heads’ each with a different Q, K, V – their outputs are added together before that second NN pass

there’s some normalization at each stage to keep the numbers reasonable and from blowing up

you repeat the attention + forward blocks many times, then the last embedding in the final layer output is what you can sample based on

i was surprised by how quickly this just starts to generate coherent grammar etc. having the training loop also do a generation step to show example output at each stage of training was helpful to see how the output qualitatively improves over time, and it’s kind of cool to “watch” it learn.

this doesn’t cover MoE, sparse vs dense attention and also the whole thing about RL on top of these (whether for human feedback or for doing “search with backtracking and sparse reward”) – i haven’t coded those up yet just kinda read about them…

now the thing is – this is a setup for it to learn some processes spread among the weights that do what it does – but what those processes are seems still very unknown. “mechanistic interpretability” is the space that’s meant to work on that, been looking into it lately.