| ▲ | stingraycharles 2 hours ago | |
“I mean, under the covers it's got to be converting to text tokens at some point” Multi-modal models do actually natively tokenize images, though. So it doesn’t have to be converted to text for it to work. They may do it anyway for accuracy, but it’s not at all required. Effectively an image is scaled to a standard size, rasterized / cut up, and each cut is assigned a separate token, much in the same way text is tokenized. Train the model on this as well and you’ll end up having a model that can understand images. | ||