▲ | wizardforhire 2 days ago | |||||||
Obligatory [1] My apologies for not being able to find the original tale. I’m sure the original website is around but this is a decent synopsis regardless. Doesn’t look like they cover it in the article but if I remember correctly they pruned the model down to fit on 56k eprom that was able to be sold for originally $10 (also dating myself, this article claims $15) And of course the jargon has changed with time, I guess were saying distilled now, originally we said pruned… because thats what you did once you had your weights you would prune the rest of the network to get the core model. I guess distilled works also, just less literal imho. I guess if we want to get really pedantic networks exists in liquids, but I digress. [1] (apologies for the add crap, best I could find) https://www.mentalfloss.com/article/22269/how-electronic-20-... | ||||||||
▲ | DoctorOetker 2 days ago | parent | next [-] | |||||||
pruning and distilling are 2 totally different things. pruning: discarding low weight connections after training, makes the network sparser but also less regular (complications for memory layout, and compute kernels to access the sparse network weights). distilling: take a large pretrained model, and train a smaller one from it, for example consider a cloze task (fill the blanked token in a sentence), then compute the probabilities using the large model, and train the smaller model to reproduce the same probabilities distilling is a form of fitting into a smaller regular network, of potentially totally different architecture, while pruning is a form of discarding low weight coefficients resulting in a sparser network. | ||||||||
| ||||||||
▲ | meatmanek 2 days ago | parent | prev [-] | |||||||
I'm surprised those things used neural networks. With a matrix of answer probabilities (trivially calculated from people's answers), you can choose the question that maximizes your expected information gain. | ||||||||
|