▲ | yorwba a month ago | |
You could downscale text the same way you downscale images, by averaging token embeddings instead of pixel values. But you don't have to. AFAIK vision transformers don't suffer from sparse gradients that need a resolution hierarchy to overcome, downscaling is just a performance optimization, because processing an image at full resolution is expensive. | ||
▲ | sroussey a month ago | parent [-] | |
So downscaling will summarize? |