Information Theory - Entropy

NameEquationInterpretation
EntropyH(P)=ExP[lnP(x)]H(P) = E_{x\sim P} [- \ln P(x)]Average of negative log probability.
Cross-EntropyH(P,Q)=ExP[lnQ(x)]H(P, Q) = E_{x \sim P} [- \ln Q(x)]The average number of nats (cf. bits) to represent xx using the imperfect code defined by QQ (data compression view).
KL DivergenceKL(P,Q)=H(P,Q)H(P)=ExP[lnQ(x)P(x)]\begin{aligned} KL(P, Q) &= H(P, Q) - H(P) \\ &= E_{x\sim P} [-\ln \frac{Q(x)}{P(x)}] \end{aligned}How many more nats (cf. bits) is required to represent xx with code QQ compared with the optimal code (PP). PP is called the target distribution.
*Conditional EntropyH(P(xy))=Ex,yP(x,y)[lnP(x,y)P(y)]=H(P(x,y))H(P(y))\begin{aligned} H(P(x\|y)) &= E_{x, y \sim P(x,y)}[-\ln \frac{P(x, y)}{P(y)}] \\ &= H(P(x, y)) - H(P(y)) \end{aligned}
*Mutual InformationI(x,y)=KL(P(x,y),P(x)P(y))=Ex,yP(x,y)[lnP(x)P(y)P(x,y)]=H(P(y))H(P(yx))=H(P(x))H(P(xy))\begin{aligned} I(x, y) &= KL(P(x, y), P(x)P(y)) \\ &= E_{x, y \sim P(x, y)}[-\ln\frac{P(x)P(y)}{P(x, y)}] \\ &= H(P(y)) - H(P(y \| x)) \\ &= H(P(x)) - H(P(x \| y)) \end{aligned}How much knowing xx helps to know yy.

KL Divergence is always non-negative (hence MI also is)

H(P,Q)H(P)KL(P,Q)0I(P(x,y))0\begin{align*} H(P, Q) \geq H(P) &\Leftrightarrow KL(P, Q) \geq 0 \\ &\Leftrightarrow I(P(x, y)) \geq 0 \end{align*}

Example: Binary classification

Let's say we do a coin-toss (biased coin). The coin can take either Head or Tail. We want to obtain a model that predicts if the coin gives Head or Tail.

Using a random variable xx, we can write each event as x=Hx = H or x=Tx = T. Let's define our model:

Q(x){p(x=H)1p(x=T),Q(x) \coloneqq \begin{cases} p & (x = H) \\ 1 - p & (x = T) \end{cases},

where 0p10 \leq p \leq 1 is a parameter that the model learns.

👉 Entropy
Although this is not a standard path to take, but let's compute the entropy of QQ.

H(Q)=ExQ[lnQ(x)]=x{H,T}Q(x)lnQ(x)=Q(x=H)lnQ(x=H)Q(x=T)lnQ(x=T)=plnp(1p)ln(1p)\begin{align} H(Q) &= \mathbb{E}_{x \sim Q} [-\ln Q(x)] \\ &= \sum_{x \in \{H, T\}} - Q(x) \ln Q(x) \\ &= -Q(x = H) \ln Q(x = H) - Q(x = T) \ln Q(x = T) \\ &= -p \ln p - (1 - p) \ln (1 - p) \end{align}

Now, what is max and min of this entropy? To find that out, let's think about the shape of plnp(1p)ln(1p)-p \ln p - (1 - p) \ln (1 - p).

Shape of the entropy w.r.t. pp
What does plnp(1p)ln(1p)-p \ln p - (1-p) \ln (1-p) look like? We can have a few observations about the form of equation, and make a guess.

  • There is a symmetry (ln-\blacktriangle \ln \blacktriangle)
  • plnp- p \ln p is a convex function (imagine its shape)
  • Adding two convex functions gives a new convex function
  • Given that the symmetry is around p=1/2p = 1/2 (p p~ vs  1p ~1 - p~ in (0p10 \leq p \leq 1)),
    the maximum should occur at p=1/2p = 1/2

As a result:

  • Min: H(Q)=0H(Q) = 0 when p=0p = 0 or p=1p = 1.
  • Max: H(Q)=ln2H(Q) = \ln 2 when p=1/2p = 1/2.

What does this mean? (How to interpret this?)
Can we see it as perplexity (?): exp(ln2)=2\exp{(\ln 2)} = 2 -> Two possible outcomes...

👉 Cross-entropy: The first thing to try
Now, let's try to define cross-entropy for the target distribution P(x)P(x). Cross entropy is:

H(P,Q)ExP[lnQ(x)]=x{H,T}P(x)lnQ(x)=P(x=H)lnQ(x=H)P(x=T)lnQ(x=T)=P(x=H)lnpP(x=T)ln(1p)\begin{align} H(P, Q) &\coloneqq \mathbb{E}_{x \sim P} \big[-\ln Q(x) \big] \\ &= \sum_{x \in \{H, T\}} - P(x) \ln Q(x) \\ &= -P(x = H) \ln Q(x = H) - P(x = T) \ln Q(x = T) \\ &= -P(x = H) \ln p - P(x = T) \ln (1 - p) \end{align}

Now, what should the target distribution P(x)P(x) look like? Woah, this is such a non-standard quenstion, but let's do it anyway.

Let's say this is the target distribution that is unknown to the model:

P(x){ptgt(x=H)1ptgt(x=T),P(x) \coloneqq \begin{cases} p^\textrm{tgt} & (x = H) \\ 1 - p^\textrm{tgt} & (x = T) \end{cases},

Then, the cross entropy would be

H(P,Q)=P(x=H)lnpP(x=T)ln(1p)=ptgtlnp(1ptgt)ln(1p)\begin{align} H(P, Q) &= -P(x = H) \ln p - P(x = T) \ln (1 - p) \\ &= -{\color{magenta}p^\textrm{tgt}} \ln p - (1 - {\color{magenta}p^\textrm{tgt}}) \ln (1 - p) \end{align}

I guess we can try to maximize this w.r.t. pp. That should give us p=ptgtp = p^\textrm{tgt} in the end... But again, this is quite a non-standard way to go.

👉 Cross-entropy: a rather standard thing to do
Okay, finally something that is a very standard question to ask. Given a dataset that stores actuall events observed D={(x=T), (x=H), (x=H), }\mathcal{D} = \{(x'=T),~(x'=H),~(x'=H),~\cdots\}, what does the empirical cross entropy look like? In this case, let's define empirical target distribution for each sample (why does that make sense?).

  • When the sample is x=Hx' = H:     P(x)1 if x=H, else 0P(x) \coloneqq 1 \text{ if } x = H, \text{ else } 0.
  • When the sample with x=Tx' = T:      P(x)1 if x=T, else 0P(x) \coloneqq 1 \text{ if } x = T, \text{ else } 0.
  • Note that this is a mapping from a single sample xx' to a distribution P(x)P(x).
    • But why can we change the (target) distribution P(x)P(x) for each sample? How is that reasonable...?

With these,

H(P,Q)ExP[lnQ(x)]=xDP(x)lnQ(x)=xDP(x=x)lnQ(x=x)=xD{P(x=H)lnQ(x=H)if x=HP(x=T)lnQ(x=T)if x=T=xD{lnpif x=Hln(1p)if x=T.\begin{align} H(P, Q) &\coloneqq \mathbb{E}_{x \sim P} \big[-\ln Q(x) \big] \\ &= \sum_{x' \in \mathcal{D}} - P(x') \ln Q(x') \\ &= \sum_{x' \in \mathcal{D}} - P(x=x') \ln Q(x=x') \\ &= \sum_{x' \in \mathcal{D}} \begin{cases} - P(x=H) \ln Q(x=H) & \text{if } x' = H \\ - P(x=T) \ln Q(x=T) & \text{if } x' = T \end{cases} \\ &= \sum_{x' \in \mathcal{D}} \begin{cases} - \ln p & \text{if } x' = H \\ - \ln (1-p) & \text{if } x' = T \end{cases}. \end{align}

We can end this here. But a weird trick that people always do is to simplify this using a binary label yy. Let's define yy as

y{1if x=H0if x=T.y' \coloneqq \begin{cases} 1 & \text{if } x' = H \\ 0 & \text{if } x' = T \end{cases}.

Then, we can write the above as

H(P,Q)=xDylnp(1y)ln(1p).H(P, Q) = \sum_{x' \in \mathcal{D}} - y' \ln p - (1-y') \ln (1 - p).

Notice that only one of these two terms will be non-zero for each xx'. Personally, I hate this simple form as this makes it look unnecessarily complicated, but this is a standard formulation that people commonly use.

Notice that we are summing over the data samples, not the all possible values of xx.

Reference