This might be an interesting research question: can you train a model on many languages, and then extract a much smaller model that knows only one language without much loss of quality?