Remix.run Logo
jvalencia 11 hours ago

Multimodal models are trained knowing how to understand encoded images. It really is magic. Base64 image data is a binary-to-text encoding scheme that represents image bytes as a printable ASCII string, formatted as data:[<mediatype>][;base64],<data>. We think of llms as only good at text, but any structured data is predictable. As long as it can be turned into an N dimensional vector that represents a complex idea in the LLM hidden weights, the output of the model treats that essentially like text. With sufficient training data, it understands the text in what to us looks like noise.