Remix.run Logo
sireat 4 days ago

Like Simon I've started to use camera for random ChatGPT research. For one ChatGPT works fantastically at random bird identification (along with pretty much all other features and likely location) - https://xkcd.com/1425/

There is one big failure mode though - ChatGPT hallucinates middle of simple textual OCR tasks!

I will feed ChatGPT a simple computer hardware invoice with 10 items - out comes perfect first few items, then likely but fake middle items (like MSI 4060 16GB instead of Asus 5060 Ti 16GB) and last few items are again correct.

If you start prompting with hints, the model will keep making up other models and manufacturers, it will apologize and come up with incorrect Gigabyte 5070.

I can forgive mistaking 5060 for 5080 - see https://www.theguardian.com/books/booksblog/2014/may/01/scan... . However how can the model completely misread the manufacturers??

This would be trivially fixed by reverting to Tesseract based models like ChatGPT used to do.

PS Just tried it again and 3rd item instead of correct GSKILL it gave Kingston as manufacturer for RAM.

Basically ChatGPT sort of OCRs like a human would, by scanning first then sort of confabulating middle and then getting the footer correct.

simonw 4 days ago | parent [-]

Yeah, I've been disappointed in GPT-5 for OCR - Gemini 2.5 is much better on that front: https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-cod...

IanCal 4 days ago | parent [-]

Images in general, nothing comes close to Gemini 2.5 for understanding scene composition. They perform segmentation and so you can even ask for things like masks of arbitrary things or bounding boxes.