▲ | light_hue_1 15 hours ago | |
This isn't the reason. Models are pretty good at understanding relative positions. We put that in them and reward it a lot. The issue is the same as why we don't use LLMs for image generation. Even though they can nominally do that. Image generation seems to need some amount of ability to revise the output in place. And it needs a big picture view to make local decisions. It doesn't lend itself to outputting pixel by pixel or character by character. |