▲ | yorwba 2 days ago | |
You could downscale text the same way you downscale images, by averaging token embeddings instead of pixel values. But you don't have to. AFAIK vision transformers don't suffer from sparse gradients that need a resolution hierarchy to overcome, downscaling is just a performance optimization, because processing an image at full resolution is expensive. | ||
▲ | sroussey a day ago | parent [-] | |
So downscaling will summarize? |