Presenting information theory as a series of independent equations like this does a disservice to the learning process. Cross-entropy and KL-divergence are directly derived from information entropy, where InformationEntropy(P) represents the baseline number of bits needed to encode events from the true distribution P, CrossEntropy(P, Q) represents the (average) number of bits needed for encoding P with a suboptimal distribution Q, and KL-Divergence (better referred to as relative entropy) is the difference between these two values (how many more bits are needed to encode P with Q, i.e. quantifying the inefficiency):

relative_entropy(p, q) = cross_entropy(p, q) - entropy(p)

Information theory is some of the most accessible and approachable math for ML practitioners, and it shows up everywhere. In my experience, it's worthwhile to dig into the foundations as opposed to just memorizing the formulas.

(bits assume base 2 here)

▲

morleytj 4 days ago | parent | next [-]

I 100% agree.

I think Shannon's Mathematical Theory of Communication is so incredibly well written and accessible that anyone interested in information theory should just start with the real foundational work rather than lists of equations, it really is worth the time to dig into it.

▲

golddust-gecko 4 days ago | parent | prev [-]

Agree 100% with this. It gives the illusion of understanding, like when a precocious 6 year old learns the word "precocious" and feels smart because they have can say it. Or any movie with tech or science with <technical speak>.

	▲	bbminner 3 days ago \| parent [-]
		While I can share the sentiment, my small experience teaching (and studying the same area for over a decade) suggests that giving students a trivial formula to play with "as is" helps motivate its future usage well. It is difficult to teach everything important about X in one go, knowledge is accumulated in layers.