Information Theory - Entropy
Name |
Equation |
Interpretation |
Entropy |
H(P) = E_{x\sim P} [- \ln P(x)] |
Average of minus log probability. |
Cross-Entropy |
H(P, Q) = E_{x \sim P} [- \ln Q(x)] |
The average number of nats (bits) to represent x using the imperfect code defined by Q (data compression view). |
KL Divergence |
\begin{align*} KL(P, Q) &= H(P, Q) - H(P) \\ &= E_{x\sim P} [-\ln \frac{Q(x)}{P(x)}] \end{align*} |
How many more nats (bits) is required to represent x with code Q compared with the optimal code (P). |
*Conditional Entropy |
\begin{align*} H(P(x|y)) &= E_{x, y \sim P(x,y)}[-\ln \frac{P(x, y)}{P(y)}] \\ &= H(P(x, y)) - H(P(y)) \end{align*} |
|
*Mutual Information |
\begin{align*} I(x, y) &= KL(P(x, y), P(x)P(y)) \\ &= E_{x, y \sim P(x, y)}[-\ln\frac{P(x)P(y)}{P(x, y)}] \\ &= H(P(y)) - H(P(y | x)) = H(P(x)) - H(P(x | y)) \end{align*} |
How much knowing x help to know y. |
|
|
|
KL Divergence is always non-negative (hence MI also is):
\begin{align*}
H(P, Q) \geq H(P) &\Leftrightarrow KL(P, Q) \geq 0 \\
&\Leftrightarrow I(P(x, y)) \geq 0
\end{align*}
Proof: Non-negativity of KL divergence
\begin{align*}
KL(P, Q) &= {\color{blue}\mathbb{E}_{x \sim P}} [{\color{brown}- \ln} \frac{Q(x)}{P(x)}] \\
&\geq {\color{brown} -\ln} {\color{blue}\mathbb{E}_{x \sim P}}[\frac{Q(x)}{P(x)}] ~~~~~~(\because \text{Jensen's inequality})\\
&= -\ln \sum_x P(x) \frac{Q(x)}{P(x)} \\
&= -\ln \sum_x Q(x) = 0
\end{align*}
Reference