▲ | Razengan 5 days ago | ||||||||||||||||||||||||||||||||||||||||
Could someone kindly point to a convenient all-on-one ELI5 of all these words? :') | |||||||||||||||||||||||||||||||||||||||||
▲ | porridgeraisin 5 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||
Background: LLMs take your input, upscale it into a very high dimensional space, and then downscale it back to 1D at the end. This 1D list is interpreted as a list of probabilities -- one for each word in your vocabulary. i.e f(x) = downscale(upscale(x)). Each of downscale() and upscale() are parameterized (billions of params). I see you have a gamedev background, so as an example: bezier curves are parameterized functions where bezier handles are the parameters. During training, these parameters are continuously adjusted so that the output of the overall function gets closer to the expected result. Neural networks are just really flexible functions for which you can choose parameters to get any expected result, provided you have enough of them (similar to bezier curves in this regard). --- When training, you make an LLM learn that I use arch = downscale(upscale(I use)) If you want to predict the next word after that, you do next in sequence the following: I use arch btw = downscale(upscale(I use arch)) Now, multi-token prediction is having two downscale functions, one for each of the next two words, and learning it that way, basically, you have a second downscale2() that learns how to predict the next-to-next word. i.e in parallel: I use arch = downscale1(upscale(I use)) I use ____ btw = downscale2(upscale(I use)) However, this way you'll need twice the number of parameters downscale needs. And if you want to predict more tokens ahead you'll need even more parameters. What Qwen has done, is instead of downscale1 and downscale2 being completely separately parameterized functions, they set downscale1(.) = lightweight1(downscale_common(.)) and downscale2(.) = lightweight2(downscale_common(.)). This is essentially betting that a lot of the logic is common and the difference between predicting the next and next-to-next token can be captured in one lightweight function each. Lightweight here, means less parameters. The bet paid off. So overall, you save params. Concretely, Before: downscale1.params + downscale2.params After: downscale_common.params + lightweight1.params + lightweight2.params Edit: its actually downscale_common(lightweight()) and not the other way around as I have written above. Doesn't change the crux of the answer, but just including this for clarity. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
▲ | lcnPylGDnU4H9OF 5 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
The best primer I've seen is Andrej Karpathy's first video in his "zero to hero" series. It's worth following along with your own practice. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
▲ | vessenes 5 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate. If you want to understand what's going on, I think the best thing to do is some intro courses, train and design some smaller models directly, get a list of core papers and concepts from Claude/Chat/Gemini, and then as you read something like this, if you don't know the acronym (In this case: MTP = Multi Token Prediction), search it up, and see if you have the basis for understanding what it's about. If not read up on the precursors. Unlike many disciplines, AI is an arena that doesn't have a lot of intuitive simplified models that are accurate -- most of the simplified models available do not accurately describe what's going on enough to reason about and understand them. So, you just have to start reading! | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
▲ | wickedsight 5 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
For me, ChatGPT or any of the other current thinking models are very useful for this type of stuff. I just ask to explain it on my level and then I can ask questions for clarification. | |||||||||||||||||||||||||||||||||||||||||
▲ | pmarreck 4 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||
The following was generated by chatG5:
|