| ▲ | storywatch 2 hours ago | |
Haven't read the full paper but thr local generation window is a little small, especially since image inputs are especially token heavy. Depending on where the local attention layer is located, it would be nicer if it's bigger e.g. 4096 words at least. | ||