Why wouldn’t predicting the next frame in a video stream be as effective as predicting the next word?