| ▲ | michaelt 2 hours ago | |
Not necessarily. See the paper See "DeepSeek-OCR: Contexts Optical Compression" [1] One option, when an image is fed into an LLM, is to divide it into tiles, then those tiles pass through a 'vision encoder' neural network to make 'vision tokens' which are then input into the LLM much like text tokens are. Obviously you train the vision encoder and LLM to understand one another. This is known as an 'end-to-end OCR model'. And it turns out, once you've trained a model to do this, you can vary the number of 'vision tokens' used to represent a given text document by scaling an image of a document up or down, and see what happens. You also get a load of other parameters like patch size and vision encoder complexity and so on. Turns out it works really well; in some tests they used 90% fewer input tokens, but still got 97% output performance. | ||