# Information Theory - Entropy

Name Equation Interpretation
Entropy
H(P) = E_{x\sim P} [- \ln P(x)]
Average of minus log probability.
Cross-Entropy
H(P, Q) = E_{x \sim P} [- \ln Q(x)]
The average number of nats (bits) to represent x using the imperfect code defined by Q (data compression view).
KL Divergence
\begin{align*} KL(P, Q) &= H(P, Q) - H(P) \\ &= E_{x\sim P} [-\ln \frac{Q(x)}{P(x)}] \end{align*}
How many more nats (bits) is required to represent x with code Q compared with the optimal code (P).
*Conditional Entropy
\begin{align*} H(P(x|y)) &= E_{x, y \sim P(x,y)}[-\ln \frac{P(x, y)}{P(y)}] \\ &= H(P(x, y)) - H(P(y)) \end{align*}
*Mutual Information
\begin{align*} I(x, y) &= KL(P(x, y), P(x)P(y)) \\ &= E_{x, y \sim P(x, y)}[-\ln\frac{P(x)P(y)}{P(x, y)}] \\ &= H(P(y)) - H(P(y | x)) = H(P(x)) - H(P(x | y)) \end{align*}
How much knowing x help to know y.

KL Divergence is always non-negative (hence MI also is):

\begin{align*} H(P, Q) \geq H(P) &\Leftrightarrow KL(P, Q) \geq 0 \\ &\Leftrightarrow I(P(x, y)) \geq 0 \end{align*}
Proof: Non-negativity of KL divergence
\begin{align*} KL(P, Q) &= {\color{blue}\mathbb{E}_{x \sim P}} [{\color{brown}- \ln} \frac{Q(x)}{P(x)}] \\ &\geq {\color{brown} -\ln} {\color{blue}\mathbb{E}_{x \sim P}}[\frac{Q(x)}{P(x)}] ~~~~~~(\because \text{Jensen's inequality})\\ &= -\ln \sum_x P(x) \frac{Q(x)}{P(x)} \\ &= -\ln \sum_x Q(x) = 0 \end{align*}