| ▲ | thecr0w a day ago | |
Yeah, still trying to build my intuition. Experiments/investigations like this help me. Any other blogs or experiments you'd suggest? | ||
| ▲ | fnordpiglet a day ago | parent [-] | |
Asking your favorite LLM actually helps a lot. They generally are well trained on LLM papers unsurprisingly. In this case though it’s important to realize the LLM is incapable of seeing or hearing or reading. Everything has to be transformed into a vector space. Images are generally cut into patches (like 16x16) which are themselves transformed by several neural networks to convert them into a semantic space represented by the models parameters. But this isn’t hugely different than your vision. You don’t see the pixel grid either. You have to use tools to measure things. You have the ability over time to iteratively interact with the image by perhaps counting grid lines but the LLM does not - it’s a one shot inference against this highly transformed image. They’ve gotten better at complex visual tasks including types of counting, but it’s not able to examine the image in any analytical way or even in its original representation. It’s just not possible. It can however make tools that can. It’s very good at working with PIL and other image processing libraries or even writing image processing code de novo, and then using those to ground itself. Likewise it can not do math, but it can write a calculator that can do highly complex mathematics on its behalf. | ||