well, SO is probably the highest quality data source for a language model and the rest of the internet is just diluting the final latent space limited by Jon Skeet.