Remix.run Logo
FromTheFirstIn 2 hours ago

My understanding is that the true entropy floor of a language is intractable- regardless of context length there will be “unpredictable” tokens where cross entropy loss is bound to happen. Even with infinite parameters and data you’ll still have a chance at failing to predict the next token correctly a decent chunk of the time.

Also, linear gains in context length scale quadratically with compute because of attention, so depending on context growth means taking a bath on GPUs for as long as you can, right?