| ▲ | mitthrowaway2 4 hours ago | |
My interpretation is that it's the other way around. The language model trainer's job is to find the network weights that make the model best at compressing the data in the training set. So what this means is that, say, professional work-speak text samples and hacker l33t-speak text samples are different enough that they end up being predicted by different sparse sub-networks; it was apparently too hard to find a smaller solution in which the same sub-network weights predict both outputs. | ||