Hopefully in the next decade we’ll get there.
Vision+language multimodal models seem to solve some of the hard problems.