| ▲ | fnordpiglet a day ago | ||||||||||||||||
Especially since it decomposes the image into a semantic vector space rather than the actual grid of pixels. Once the image is transformed into patch embeddings all sense of pixels is entirely destroyed. The author demonstrates a profound lack of understanding for how multimodal LLMs function that a simple query of one would elucidate immediately. The right way to handle this is not to build it grids and whatnot, which all get blown away by the embedding encoding but to instruct it to build image processing tools of its own and to mandate their use in constructing the coordinates required and computing the eccentricity of the pattern etc in code and language space. Doing it this way you can even get it to write assertive tests comparing the original layout to the final among various image processing metrics. This would assuredly work better, take far less time, be more stable on iteration, and fits neatly into how a multimodal agentic programming tool actually functions. | |||||||||||||||||
| ▲ | mcbuilder a day ago | parent | next [-] | ||||||||||||||||
Yeah, this is exactly what I was thinking. LLMs don't have precise geometrical reasoning from images. Having an intuition of how the models work is actually.a defining skill in "prompt engineering" | |||||||||||||||||
| |||||||||||||||||
| ▲ | thecr0w a day ago | parent | prev [-] | ||||||||||||||||
Great, thanks for that suggestion! | |||||||||||||||||