Arxiv makes the latex source available for download since a while ago. I'm sure all of that data has long been used for training already.