In their little algorithm box on Chain Distillation, they have at step 2b some expression that involves multiplying and dividing by `T`, and then they say "where α = 0.5, T = 1.0".

I think someone during the copy-editing process told them this needed to look more complicated?

▲ arjie 5 hours ago | parent | next [-]

tl;dr it makes sense once you see there are hidden softmax in there; it's just the explicit formula written out and then applied with the common param value

Bloody hell, I am so unfamiliar with ML notation:

    L = (1 - α) · CE(M_k(x), y) + α · T² · KL(M_k(x)/T ‖ M_{k-1}(x)/T)

So CE is cross-entropy and KL is Kullback-Leibler, but then division by T is kind of silly there since it falls out of the KL formula. So considering the subject, this is probably the conversion from logits to probabilities as in Hinton's paper https://arxiv.org/pdf/1503.02531

But that means there's a hidden softmax there not specified. Very terse, if so. And then the multiplication makes sense because he says:

> Since the magnitudes of the gradients produced by the soft targets scale as 1/T2 it is important to multiply them by T2 when using both hard and soft targets.

I guess to someone familiar with the field they obviously insert the softmax there and the division by T goes inside it but boy is it confusing if you're not familiar (and I am not familiar). Particularly because they're being so explicit about writing out the full loss formula just to set T to 1 in the end. That's all consistent. In writing out the formula for probabilities q_i from logits M_k(x)_i:

    q_i = exp(M_k(x)_i / T) / sum_j exp(M_k(x)_j / T)

Hinton says

> where T is a temperature that is normally set to 1. Using a higher value for T produces a softer probability distribution over classes.

So the real formula is

    L = (1 - α) · CE(softmax(M_k(x)), y) + α · T² · KL(softmax(M_k(x)/T) ‖ softmax(M_{k-1}(x)/T))

And then they're using the usual form of setting T to 1. The reason they specify the full thing is just because that's the standard loss function, and it must be the case that people in this field frequently assume softmaxes where necessary to turn logits into probabilities. In this field this must be such a common operation that writing it out just hurts readability. I would guess one of them reading this would be like "yeah, obviously you softmax, you can't KL a vector of logits".

Good question. I just sort of skipped over that when reading but what you said made me think about it.

▲ sdpmas 8 hours ago | parent | prev [-]

the T stands for tea :)

	▲	naruhodo 8 hours ago \| parent [-]
		Ah, so it's a source of randomness! Presumably 1.0 corresponds to a really hot cup of fresh tea.