▲ | koakuma-chan 5 days ago | ||||||||||||||||||||||||||||||||||
What is VLM? | |||||||||||||||||||||||||||||||||||
▲ | pwatsonwailes 5 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||
Vision language models. Basically an LLM plus a vision encoder, so the LLM can look at stuff. | |||||||||||||||||||||||||||||||||||
▲ | echelon 5 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Vision language model. You feed it an image. It determines what is in the image and gives you text. The output can be objects, or something much richer like a full text description of everything happening in the image. VLMs are hugely significant. Not only are they great for product use cases, giving users the ability to ask questions with images, but they're how we gather the synthetic training data to build image and video animation models. We couldn't do that at scale without VLMs. No human annotator would be up to the task of annotating billions of images and videos at scale and consistently. Since they're a combination of an LLM and image encoder, you can ask it questions and it can give you smart feedback. You can ask it, "Does this image contain a fire truck?" or, "You are labeling scenes from movies, please describe what you see." | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
▲ | dmos62 5 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||
LLM is a large language model, VLM is a vision language model of unknown size. Hehe. |