Remix.run Logo
wizardforhire 2 days ago

Obligatory [1]

My apologies for not being able to find the original tale. I’m sure the original website is around but this is a decent synopsis regardless.

Doesn’t look like they cover it in the article but if I remember correctly they pruned the model down to fit on 56k eprom that was able to be sold for originally $10 (also dating myself, this article claims $15)

And of course the jargon has changed with time, I guess were saying distilled now, originally we said pruned… because thats what you did once you had your weights you would prune the rest of the network to get the core model. I guess distilled works also, just less literal imho. I guess if we want to get really pedantic networks exists in liquids, but I digress.

[1] (apologies for the add crap, best I could find) https://www.mentalfloss.com/article/22269/how-electronic-20-...

DoctorOetker 2 days ago | parent | next [-]

pruning and distilling are 2 totally different things.

pruning: discarding low weight connections after training, makes the network sparser but also less regular (complications for memory layout, and compute kernels to access the sparse network weights).

distilling: take a large pretrained model, and train a smaller one from it, for example consider a cloze task (fill the blanked token in a sentence), then compute the probabilities using the large model, and train the smaller model to reproduce the same probabilities

distilling is a form of fitting into a smaller regular network, of potentially totally different architecture, while pruning is a form of discarding low weight coefficients resulting in a sparser network.

wizardforhire a day ago | parent [-]

Thanks for taking the time to clarify for me.

meatmanek 2 days ago | parent | prev [-]

I'm surprised those things used neural networks. With a matrix of answer probabilities (trivially calculated from people's answers), you can choose the question that maximizes your expected information gain.

wizardforhire 2 days ago | parent [-]

As I remember it, it was the break out moment for NN that made them mainstream to the masses. Prior to that they were an academic / hacker oddity relegated to works of fictions and just one of the many competing theories towards functioning AI. After 20Q you could buy a handheld NN at walmart. The delay to LLM was such that 20Q made it apparent to the scene that the limiting factor for more practical ai development was purely a scaling problem of complexity limited by compute power. A lot of conversations on /. and the likes centered around when the threshold would be crossed. Most at the time could not have predicted nor accepted that moore’s law would fail putting development back a decade.

To the credit of the naysayers at the time hotmail was still the primary free email service, gmail had yet to come out. Google buying up the darkfiber and had yet to open up their excess compute starting the arms race for the cloud. Most still thought of GPUs only for graphics even though their architecture and intent was there since their inception at thinking machines…