Remix.run Logo
vessenes 3 days ago

The reason you're getting slightly downvoted, I think, is that you need to answer this question first: which of the 15T tokens are you going to evaluate for forgetting? And, please explain how doing that is different than doing another full epoch type pass over the weights.

Some of the appeal here is that this architecture (handcrafted) allows ongoing gradient descent learning as you go on a much smaller set of weights.