▲ | ACCount37 3 days ago | |
I agree with the general idea, but "sucks at reading plumbing diagrams" is the one specific example where Claude is actually choked by its unfortunate architecture. The "naive" vision implementation for LLMs is: break the input image down into N tokens and cram those tokens into the context window. The "break the input image down" part is completely unaware of the LLM's context, and doesn't know what data would be useful to the LLM at all. Often, the vision frontend just tries to convey the general "vibes" of the image to the LLM backend, and hopes that the LLM can pick out something useful from that. Which is "good enough" for a lot of tasks, but not all of them, not at all. |