I suppose the gold standard would be a multimodal model that also looks at the screen (maybe only if the captions aren't making much sense).