Information Theory - Entropy Oct 31, 2022 Name Equation Interpretation Entropy H ( P ) = E x ∼ P [ − ln  P ( x ) ] H(P) = E_{x\sim P} [- \ln P(x)] H ( P ) = E x ∼ P  [ − ln P ( x )] Average of negative log probability. Cross-Entropy H ( P , Q ) = E x ∼ P [ − ln  Q ( x ) ] H(P, Q) = E_{x \sim P} [- \ln Q(x)] H ( P , Q ) = E x ∼ P  [ − ln Q ( x )] The average number of nats (cf. bits) to represent x x x Q Q Q  KL Divergence K L ( P , Q ) = H ( P , Q ) − H ( P ) = E x ∼ P [ − ln  Q ( x ) P ( x ) ] \begin{aligned} KL(P, Q) &= H(P, Q) - H(P) \\ &= E_{x\sim P} [-\ln \frac{Q(x)}{P(x)}] \end{aligned} K L ( P , Q )  = H ( P , Q ) − H ( P ) = E x ∼ P  [ − ln P ( x ) Q ( x )  ]  How many more nats (cf. bits) is required to represent x x x Q Q Q P P P P P P target  distribution. *Conditional Entropy H ( P ( x ∥ y ) ) = E x , y ∼ P ( x , y ) [ − ln  P ( x , y ) P ( y ) ] = H ( P ( x , y ) ) − H ( P ( y ) ) \begin{aligned} H(P(x\|y)) &= E_{x, y \sim P(x,y)}[-\ln \frac{P(x, y)}{P(y)}] \\ &= H(P(x, y)) - H(P(y)) \end{aligned} H ( P ( x ∥ y ))  = E x , y ∼ P ( x , y )  [ − ln P ( y ) P ( x , y )  ] = H ( P ( x , y )) − H ( P ( y ))  *Mutual Information I ( x , y ) = K L ( P ( x , y ) , P ( x ) P ( y ) ) = E x , y ∼ P ( x , y ) [ − ln  P ( x ) P ( y ) P ( x , y ) ] = H ( P ( y ) ) − H ( P ( y ∥ x ) ) = H ( P ( x ) ) − H ( P ( x ∥ y ) ) \begin{aligned} I(x, y) &= KL(P(x, y), P(x)P(y)) \\ &= E_{x, y \sim P(x, y)}[-\ln\frac{P(x)P(y)}{P(x, y)}] \\ &= H(P(y)) - H(P(y \| x)) \\ &= H(P(x)) - H(P(x \| y)) \end{aligned} I ( x , y )  = K L ( P ( x , y ) , P ( x ) P ( y )) = E x , y ∼ P ( x , y )  [ − ln P ( x , y ) P ( x ) P ( y )  ] = H ( P ( y )) − H ( P ( y ∥ x )) = H ( P ( x )) − H ( P ( x ∥ y ))  How much knowing x x x y y y  
KL Divergence is always non-negative (hence MI also is)
H ( P , Q ) ≥ H ( P ) ⇔ K L ( P , Q ) ≥ 0 ⇔ I ( P ( x , y ) ) ≥ 0 \begin{align*}
H(P, Q) \geq H(P) &\Leftrightarrow KL(P, Q) \geq 0 \\
&\Leftrightarrow I(P(x, y)) \geq 0
\end{align*} H ( P , Q ) ≥ H ( P )  ⇔ K L ( P , Q ) ≥ 0 ⇔ I ( P ( x , y )) ≥ 0  ►   Proof: Non-negativity of KL divergence
Example: Binary classification 
Let's say we do a coin-toss (biased coin). The coin can take either Head or Tail.
We want to obtain a model that predicts if the coin gives Head or Tail.
Using a random variable x x x x = H x = H x = H x = T x = T x = T 
Q ( x ) ≔ { p ( x = H ) 1 − p ( x = T ) , Q(x) \coloneqq
\begin{cases}
p & (x = H) \\
1 - p & (x = T)
\end{cases}, Q ( x ) : = { p 1 − p  ( x = H ) ( x = T )  , where 0 ≤ p ≤ 1 0 \leq p \leq 1 0 ≤ p ≤ 1 
👉 Entropy Q Q Q 
H ( Q ) = E x ∼ Q [ − ln  Q ( x ) ] = ∑ x ∈ { H , T } − Q ( x ) ln  Q ( x ) = − Q ( x = H ) ln  Q ( x = H ) − Q ( x = T ) ln  Q ( x = T ) = − p ln  p − ( 1 − p ) ln  ( 1 − p ) \begin{align}
H(Q)    &= \mathbb{E}_{x \sim Q} [-\ln Q(x)] \\
        &= \sum_{x \in \{H, T\}} - Q(x) \ln Q(x) \\
        &= -Q(x = H) \ln Q(x = H) - Q(x = T) \ln Q(x = T) \\
        &= -p \ln p - (1 - p) \ln (1 - p)
\end{align} H ( Q )  = E x ∼ Q  [ − ln Q ( x )] = x ∈ { H , T } ∑  − Q ( x ) ln Q ( x ) = − Q ( x = H ) ln Q ( x = H ) − Q ( x = T ) ln Q ( x = T ) = − p ln p − ( 1 − p ) ln ( 1 − p )   Now, what is max and min of this entropy?
To find that out, let's think about the shape of − p ln  p − ( 1 − p ) ln  ( 1 − p ) -p \ln p - (1 - p) \ln (1 - p) − p ln p − ( 1 − p ) ln ( 1 − p ) 
Shape of the entropy w.r.t. p p p  − p ln  p − ( 1 − p ) ln  ( 1 − p ) -p \ln p - (1-p) \ln (1-p) − p ln p − ( 1 − p ) ln ( 1 − p ) 
There is a symmetry (− ▲ ln  ▲ -\blacktriangle \ln \blacktriangle − ▲ ln ▲  
− p ln  p - p \ln p − p ln p Adding two convex functions gives a new convex function 
Given that the symmetry is around p = 1 / 2 p = 1/2 p = 1/2 p   p~ p     1 − p   ~1 - p~   1 − p   0 ≤ p ≤ 1 0 \leq p \leq 1 0 ≤ p ≤ 1 p = 1 / 2 p = 1/2 p = 1/2  
 
►   Think about xlogx
As a result:
Min: H ( Q ) = 0 H(Q) = 0 H ( Q ) = 0 p = 0 p = 0 p = 0 p = 1 p = 1 p = 1  
Max: H ( Q ) = ln  2 H(Q) = \ln 2 H ( Q ) = ln 2 p = 1 / 2 p = 1/2 p = 1/2  
 
What does this mean? (How to interpret this?)exp  ( ln  2 ) = 2 \exp{(\ln 2)} = 2 exp ( ln 2 ) = 2 
👉 Cross-entropy: The first thing to try P ( x ) P(x) P ( x ) 
H ( P , Q ) ≔ E x ∼ P [ − ln  Q ( x ) ] = ∑ x ∈ { H , T } − P ( x ) ln  Q ( x ) = − P ( x = H ) ln  Q ( x = H ) − P ( x = T ) ln  Q ( x = T ) = − P ( x = H ) ln  p − P ( x = T ) ln  ( 1 − p ) \begin{align}
H(P, Q) &\coloneqq \mathbb{E}_{x \sim P} \big[-\ln Q(x) \big] \\
        &= \sum_{x \in \{H, T\}} - P(x) \ln Q(x) \\
        &= -P(x = H) \ln Q(x = H) - P(x = T) \ln Q(x = T) \\
        &= -P(x = H) \ln p - P(x = T) \ln (1 - p)
\end{align} H ( P , Q )  : = E x ∼ P  [ − ln Q ( x ) ] = x ∈ { H , T } ∑  − P ( x ) ln Q ( x ) = − P ( x = H ) ln Q ( x = H ) − P ( x = T ) ln Q ( x = T ) = − P ( x = H ) ln p − P ( x = T ) ln ( 1 − p )   Now, what should the target distribution P ( x ) P(x) P ( x ) 
Let's say this is the target distribution that is unknown to the model:
P ( x ) ≔ { p tgt ( x = H ) 1 − p tgt ( x = T ) , P(x) \coloneqq
\begin{cases}
p^\textrm{tgt} & (x = H) \\
1 - p^\textrm{tgt} & (x = T)
\end{cases}, P ( x ) : = { p tgt 1 − p tgt  ( x = H ) ( x = T )  , Then, the cross entropy would be
H ( P , Q ) = − P ( x = H ) ln  p − P ( x = T ) ln  ( 1 − p ) = − p tgt ln  p − ( 1 − p tgt ) ln  ( 1 − p ) \begin{align}
H(P, Q) &= -P(x = H) \ln p - P(x = T) \ln (1 - p) \\
        &= -{\color{magenta}p^\textrm{tgt}} \ln p - (1 - {\color{magenta}p^\textrm{tgt}}) \ln (1 - p)
\end{align} H ( P , Q )  = − P ( x = H ) ln p − P ( x = T ) ln ( 1 − p ) = − p tgt ln p − ( 1 − p tgt ) ln ( 1 − p )   I guess we can try to maximize this w.r.t. p p p p = p tgt p = p^\textrm{tgt} p = p tgt 
👉 Cross-entropy: a rather standard thing to do D = { ( x ′ = T ) ,   ( x ′ = H ) ,   ( x ′ = H ) ,   ⋯   } \mathcal{D} = \{(x'=T),~(x'=H),~(x'=H),~\cdots\} D = {( x ′ = T ) ,   ( x ′ = H ) ,   ( x ′ = H ) ,   ⋯ } for each sample  (why does that make sense?).
When the sample is x ′ = H x' = H x ′ = H P ( x ) ≔ 1  if  x = H ,  else  0 P(x) \coloneqq 1 \text{ if } x = H, \text{ else } 0 P ( x ) : = 1  if  x = H ,  else  0  
When the sample with x ′ = T x' = T x ′ = T P ( x ) ≔ 1  if  x = T ,  else  0 P(x) \coloneqq 1 \text{ if } x = T, \text{ else } 0 P ( x ) : = 1  if  x = T ,  else  0  
Note that this is a mapping from a single sample x ′ x' x ′ P ( x ) P(x) P ( x )  .
But why can we change the (target) distribution P ( x ) P(x) P ( x )  
 
 
 
With these,
H ( P , Q ) ≔ E x ∼ P [ − ln  Q ( x ) ] = ∑ x ′ ∈ D − P ( x ′ ) ln  Q ( x ′ ) = ∑ x ′ ∈ D − P ( x = x ′ ) ln  Q ( x = x ′ ) = ∑ x ′ ∈ D { − P ( x = H ) ln  Q ( x = H ) if  x ′ = H − P ( x = T ) ln  Q ( x = T ) if  x ′ = T = ∑ x ′ ∈ D { − ln  p if  x ′ = H − ln  ( 1 − p ) if  x ′ = T . \begin{align}
H(P, Q) &\coloneqq \mathbb{E}_{x \sim P} \big[-\ln Q(x) \big] \\
        &= \sum_{x' \in \mathcal{D}} - P(x') \ln Q(x') \\
        &= \sum_{x' \in \mathcal{D}} - P(x=x') \ln Q(x=x') \\
        &= \sum_{x' \in \mathcal{D}}
        \begin{cases}
        - P(x=H) \ln Q(x=H) & \text{if } x' = H \\
        - P(x=T) \ln Q(x=T) & \text{if } x' = T
        \end{cases} \\
        &= \sum_{x' \in \mathcal{D}}
        \begin{cases}
        - \ln p & \text{if } x' = H \\
        - \ln (1-p) & \text{if } x' = T
        \end{cases}.
\end{align} H ( P , Q )  : = E x ∼ P  [ − ln Q ( x ) ] = x ′ ∈ D ∑  − P ( x ′ ) ln Q ( x ′ ) = x ′ ∈ D ∑  − P ( x = x ′ ) ln Q ( x = x ′ ) = x ′ ∈ D ∑  { − P ( x = H ) ln Q ( x = H ) − P ( x = T ) ln Q ( x = T )  if  x ′ = H if  x ′ = T  = x ′ ∈ D ∑  { − ln p − ln ( 1 − p )  if  x ′ = H if  x ′ = T  .   We can end this here. But a weird trick that people always do is to simplify this using a binary label y y y y y y 
y ′ ≔ { 1 if  x ′ = H 0 if  x ′ = T . y' \coloneqq
\begin{cases}
1 & \text{if } x' = H \\
0 & \text{if } x' = T
\end{cases}. y ′ : = { 1 0  if  x ′ = H if  x ′ = T  . Then, we can write the above as
H ( P , Q ) = ∑ x ′ ∈ D − y ′ ln  p − ( 1 − y ′ ) ln  ( 1 − p ) . H(P, Q) = \sum_{x' \in \mathcal{D}} - y' \ln p - (1-y') \ln (1 - p). H ( P , Q ) = x ′ ∈ D ∑  − y ′ ln p − ( 1 − y ′ ) ln ( 1 − p ) . Notice that only one of these two terms will be non-zero for each x ′ x' x ′ 
Notice that we are summing over the data samples, not the all possible values of x x x 
Reference