Information Theory - Entropy Oct 31, 2022 Name Equation Interpretation Entropy H ( P ) = E x ∼ P [ − ln P ( x ) ] H(P) = E_{x\sim P} [- \ln P(x)] H ( P ) = E x ∼ P [ − ln P ( x )] Average of negative log probability. Cross-Entropy H ( P , Q ) = E x ∼ P [ − ln Q ( x ) ] H(P, Q) = E_{x \sim P} [- \ln Q(x)] H ( P , Q ) = E x ∼ P [ − ln Q ( x )] The average number of nats (cf. bits) to represent x x x using the imperfect code defined by Q Q Q (data compression view). KL Divergence K L ( P , Q ) = H ( P , Q ) − H ( P ) = E x ∼ P [ − ln Q ( x ) P ( x ) ] \begin{aligned} KL(P, Q) &= H(P, Q) - H(P) \\ &= E_{x\sim P} [-\ln \frac{Q(x)}{P(x)}] \end{aligned} K L ( P , Q ) = H ( P , Q ) − H ( P ) = E x ∼ P [ − ln P ( x ) Q ( x ) ] How many more nats (cf. bits) is required to represent x x x with code Q Q Q compared with the optimal code (P P P ). P P P is called the target distribution. *Conditional Entropy H ( P ( x ∥ y ) ) = E x , y ∼ P ( x , y ) [ − ln P ( x , y ) P ( y ) ] = H ( P ( x , y ) ) − H ( P ( y ) ) \begin{aligned} H(P(x\|y)) &= E_{x, y \sim P(x,y)}[-\ln \frac{P(x, y)}{P(y)}] \\ &= H(P(x, y)) - H(P(y)) \end{aligned} H ( P ( x ∥ y )) = E x , y ∼ P ( x , y ) [ − ln P ( y ) P ( x , y ) ] = H ( P ( x , y )) − H ( P ( y )) *Mutual Information I ( x , y ) = K L ( P ( x , y ) , P ( x ) P ( y ) ) = E x , y ∼ P ( x , y ) [ − ln P ( x ) P ( y ) P ( x , y ) ] = H ( P ( y ) ) − H ( P ( y ∥ x ) ) = H ( P ( x ) ) − H ( P ( x ∥ y ) ) \begin{aligned} I(x, y) &= KL(P(x, y), P(x)P(y)) \\ &= E_{x, y \sim P(x, y)}[-\ln\frac{P(x)P(y)}{P(x, y)}] \\ &= H(P(y)) - H(P(y \| x)) \\ &= H(P(x)) - H(P(x \| y)) \end{aligned} I ( x , y ) = K L ( P ( x , y ) , P ( x ) P ( y )) = E x , y ∼ P ( x , y ) [ − ln P ( x , y ) P ( x ) P ( y ) ] = H ( P ( y )) − H ( P ( y ∥ x )) = H ( P ( x )) − H ( P ( x ∥ y )) How much knowing x x x helps to know y y y .
KL Divergence is always non-negative (hence MI also is)
H ( P , Q ) ≥ H ( P ) ⇔ K L ( P , Q ) ≥ 0 ⇔ I ( P ( x , y ) ) ≥ 0 \begin{align*}
H(P, Q) \geq H(P) &\Leftrightarrow KL(P, Q) \geq 0 \\
&\Leftrightarrow I(P(x, y)) \geq 0
\end{align*} H ( P , Q ) ≥ H ( P ) ⇔ K L ( P , Q ) ≥ 0 ⇔ I ( P ( x , y )) ≥ 0
► Proof: Non-negativity of KL divergence
Example: Binary classification
Let's say we do a coin-toss (biased coin). The coin can take either Head or Tail.
We want to obtain a model that predicts if the coin gives Head or Tail.
Using a random variable x x x , we can write each event as x = H x = H x = H or x = T x = T x = T .
Let's define our model:
Q ( x ) ≔ { p ( x = H ) 1 − p ( x = T ) , Q(x) \coloneqq
\begin{cases}
p & (x = H) \\
1 - p & (x = T)
\end{cases}, Q ( x ) : = { p 1 − p ( x = H ) ( x = T ) ,
where 0 ≤ p ≤ 1 0 \leq p \leq 1 0 ≤ p ≤ 1 is a parameter that the model learns.
👉 Entropy
Although this is not a standard path to take, but let's compute the entropy of Q Q Q .
H ( Q ) = E x ∼ Q [ − ln Q ( x ) ] = ∑ x ∈ { H , T } − Q ( x ) ln Q ( x ) = − Q ( x = H ) ln Q ( x = H ) − Q ( x = T ) ln Q ( x = T ) = − p ln p − ( 1 − p ) ln ( 1 − p ) \begin{align}
H(Q) &= \mathbb{E}_{x \sim Q} [-\ln Q(x)] \\
&= \sum_{x \in \{H, T\}} - Q(x) \ln Q(x) \\
&= -Q(x = H) \ln Q(x = H) - Q(x = T) \ln Q(x = T) \\
&= -p \ln p - (1 - p) \ln (1 - p)
\end{align} H ( Q ) = E x ∼ Q [ − ln Q ( x )] = x ∈ { H , T } ∑ − Q ( x ) ln Q ( x ) = − Q ( x = H ) ln Q ( x = H ) − Q ( x = T ) ln Q ( x = T ) = − p ln p − ( 1 − p ) ln ( 1 − p )
Now, what is max and min of this entropy?
To find that out, let's think about the shape of − p ln p − ( 1 − p ) ln ( 1 − p ) -p \ln p - (1 - p) \ln (1 - p) − p ln p − ( 1 − p ) ln ( 1 − p ) .
Shape of the entropy w.r.t. p p p
What does − p ln p − ( 1 − p ) ln ( 1 − p ) -p \ln p - (1-p) \ln (1-p) − p ln p − ( 1 − p ) ln ( 1 − p ) look like? We can have a few observations about the form of equation,
and make a guess.
There is a symmetry (− ▲ ln ▲ -\blacktriangle \ln \blacktriangle − ▲ ln ▲ )
− p ln p - p \ln p − p ln p is a convex function (imagine its shape)
Adding two convex functions gives a new convex function
Given that the symmetry is around p = 1 / 2 p = 1/2 p = 1/2 (p p~ p vs 1 − p ~1 - p~ 1 − p in (0 ≤ p ≤ 1 0 \leq p \leq 1 0 ≤ p ≤ 1 )),
the maximum should occur at p = 1 / 2 p = 1/2 p = 1/2
► Think about xlogx
As a result:
Min: H ( Q ) = 0 H(Q) = 0 H ( Q ) = 0 when p = 0 p = 0 p = 0 or p = 1 p = 1 p = 1 .
Max: H ( Q ) = ln 2 H(Q) = \ln 2 H ( Q ) = ln 2 when p = 1 / 2 p = 1/2 p = 1/2 .
What does this mean? (How to interpret this?)
Can we see it as perplexity (?): exp ( ln 2 ) = 2 \exp{(\ln 2)} = 2 exp ( ln 2 ) = 2 -> Two possible outcomes...
👉 Cross-entropy: The first thing to try
Now, let's try to define cross-entropy for the target distribution P ( x ) P(x) P ( x ) .
Cross entropy is:
H ( P , Q ) ≔ E x ∼ P [ − ln Q ( x ) ] = ∑ x ∈ { H , T } − P ( x ) ln Q ( x ) = − P ( x = H ) ln Q ( x = H ) − P ( x = T ) ln Q ( x = T ) = − P ( x = H ) ln p − P ( x = T ) ln ( 1 − p ) \begin{align}
H(P, Q) &\coloneqq \mathbb{E}_{x \sim P} \big[-\ln Q(x) \big] \\
&= \sum_{x \in \{H, T\}} - P(x) \ln Q(x) \\
&= -P(x = H) \ln Q(x = H) - P(x = T) \ln Q(x = T) \\
&= -P(x = H) \ln p - P(x = T) \ln (1 - p)
\end{align} H ( P , Q ) : = E x ∼ P [ − ln Q ( x ) ] = x ∈ { H , T } ∑ − P ( x ) ln Q ( x ) = − P ( x = H ) ln Q ( x = H ) − P ( x = T ) ln Q ( x = T ) = − P ( x = H ) ln p − P ( x = T ) ln ( 1 − p )
Now, what should the target distribution P ( x ) P(x) P ( x ) look like?
Woah, this is such a non-standard quenstion, but let's do it anyway.
Let's say this is the target distribution that is unknown to the model:
P ( x ) ≔ { p tgt ( x = H ) 1 − p tgt ( x = T ) , P(x) \coloneqq
\begin{cases}
p^\textrm{tgt} & (x = H) \\
1 - p^\textrm{tgt} & (x = T)
\end{cases}, P ( x ) : = { p tgt 1 − p tgt ( x = H ) ( x = T ) ,
Then, the cross entropy would be
H ( P , Q ) = − P ( x = H ) ln p − P ( x = T ) ln ( 1 − p ) = − p tgt ln p − ( 1 − p tgt ) ln ( 1 − p ) \begin{align}
H(P, Q) &= -P(x = H) \ln p - P(x = T) \ln (1 - p) \\
&= -{\color{magenta}p^\textrm{tgt}} \ln p - (1 - {\color{magenta}p^\textrm{tgt}}) \ln (1 - p)
\end{align} H ( P , Q ) = − P ( x = H ) ln p − P ( x = T ) ln ( 1 − p ) = − p tgt ln p − ( 1 − p tgt ) ln ( 1 − p )
I guess we can try to maximize this w.r.t. p p p . That should give us p = p tgt p = p^\textrm{tgt} p = p tgt in the end...
But again, this is quite a non-standard way to go.
👉 Cross-entropy: a rather standard thing to do
Okay, finally something that is a very standard question to ask.
Given a dataset that stores actuall events observed D = { ( x ′ = T ) , ( x ′ = H ) , ( x ′ = H ) , ⋯ } \mathcal{D} = \{(x'=T),~(x'=H),~(x'=H),~\cdots\} D = {( x ′ = T ) , ( x ′ = H ) , ( x ′ = H ) , ⋯ } ,
what does the empirical cross entropy look like?
In this case, let's define empirical target distribution for each sample (why does that make sense?).
When the sample is x ′ = H x' = H x ′ = H : P ( x ) ≔ 1 if x = H , else 0 P(x) \coloneqq 1 \text{ if } x = H, \text{ else } 0 P ( x ) : = 1 if x = H , else 0 .
When the sample with x ′ = T x' = T x ′ = T : P ( x ) ≔ 1 if x = T , else 0 P(x) \coloneqq 1 \text{ if } x = T, \text{ else } 0 P ( x ) : = 1 if x = T , else 0 .
Note that this is a mapping from a single sample x ′ x' x ′ to a distribution P ( x ) P(x) P ( x ) .
But why can we change the (target) distribution P ( x ) P(x) P ( x ) for each sample? How is that reasonable...?
With these,
H ( P , Q ) ≔ E x ∼ P [ − ln Q ( x ) ] = ∑ x ′ ∈ D − P ( x ′ ) ln Q ( x ′ ) = ∑ x ′ ∈ D − P ( x = x ′ ) ln Q ( x = x ′ ) = ∑ x ′ ∈ D { − P ( x = H ) ln Q ( x = H ) if x ′ = H − P ( x = T ) ln Q ( x = T ) if x ′ = T = ∑ x ′ ∈ D { − ln p if x ′ = H − ln ( 1 − p ) if x ′ = T . \begin{align}
H(P, Q) &\coloneqq \mathbb{E}_{x \sim P} \big[-\ln Q(x) \big] \\
&= \sum_{x' \in \mathcal{D}} - P(x') \ln Q(x') \\
&= \sum_{x' \in \mathcal{D}} - P(x=x') \ln Q(x=x') \\
&= \sum_{x' \in \mathcal{D}}
\begin{cases}
- P(x=H) \ln Q(x=H) & \text{if } x' = H \\
- P(x=T) \ln Q(x=T) & \text{if } x' = T
\end{cases} \\
&= \sum_{x' \in \mathcal{D}}
\begin{cases}
- \ln p & \text{if } x' = H \\
- \ln (1-p) & \text{if } x' = T
\end{cases}.
\end{align} H ( P , Q ) : = E x ∼ P [ − ln Q ( x ) ] = x ′ ∈ D ∑ − P ( x ′ ) ln Q ( x ′ ) = x ′ ∈ D ∑ − P ( x = x ′ ) ln Q ( x = x ′ ) = x ′ ∈ D ∑ { − P ( x = H ) ln Q ( x = H ) − P ( x = T ) ln Q ( x = T ) if x ′ = H if x ′ = T = x ′ ∈ D ∑ { − ln p − ln ( 1 − p ) if x ′ = H if x ′ = T .
We can end this here. But a weird trick that people always do is to simplify this using a binary label y y y .
Let's define y y y as
y ′ ≔ { 1 if x ′ = H 0 if x ′ = T . y' \coloneqq
\begin{cases}
1 & \text{if } x' = H \\
0 & \text{if } x' = T
\end{cases}. y ′ : = { 1 0 if x ′ = H if x ′ = T .
Then, we can write the above as
H ( P , Q ) = ∑ x ′ ∈ D − y ′ ln p − ( 1 − y ′ ) ln ( 1 − p ) . H(P, Q) = \sum_{x' \in \mathcal{D}} - y' \ln p - (1-y') \ln (1 - p). H ( P , Q ) = x ′ ∈ D ∑ − y ′ ln p − ( 1 − y ′ ) ln ( 1 − p ) .
Notice that only one of these two terms will be non-zero for each x ′ x' x ′ .
Personally, I hate this simple form as this makes it look unnecessarily complicated,
but this is a standard formulation that people commonly use.
Notice that we are summing over the data samples, not the all possible values of x x x .
Reference