I have difficulty trusting this. There are plenty of videos online of LLMs making up stuff like "I just ate a hot dog, is there mustard around my mouth?" "No, everything is clean" while there is a big yellow stain om the guy's face

▲

WarmWash 8 hours ago | parent | next [-]

The problem is using a language model to assess images.

Probably 80% of "LLM's are below expectation" complaints (from the general population) involves some form of image analyses.

Image tokenization is hard because unlike language tokenization, where every token is extremely dense with meaning, image tokens tends to be meaningless or irrelevant but are processed all the same.

Give an SOTA LLM a picture of toothpicks and ask it to move one to make a square, and it will probably struggle and fumble it. But give a mid-size LLM from 2 years ago the same problem in verbal form, and it will nail it almost every time.

That takeaway is, do everything you can to avoid having the LLM need to rely on images for the answer.

	▲	gruez 6 hours ago \| parent [-]
		I thought all the recent models are "multimodal"? Is the image part just sticking an image recognizer in front of the text model?

▲

RobMurray 4 hours ago | parent | prev | next [-]

Most of those videos are chatGPT voice mode, which still used gpt 4o last time I checked. it is far from SOTA.

▲

postalrat 8 hours ago | parent | prev [-]

Like coding, creating images or text, maybe the alternative of doing it yourself is too easy or enjoyable for you. Don't expect that will be true for everyone.

	▲	Almondsetat 6 hours ago \| parent [-]
		Did you reply to the wrong person? What are you even trying to say here?