References
David Blei has many good write-ups / talks on this topic
Others
Intro
Assumption:
- x=x1:n: observations
- z=z1:m: hidden variables
We want to model the posterior distribution (notice that this is a form of inference: estimating a hidden variable from observations):
p(z∣x)=∫zp(z,x)p(z,x)
The posterior links the data and a model.
In most of the interesting problems, calculating denominator is not tractable.
x is the evidence about z.
The main idea: pick a family of distributions over the latent variables parameterized with its own variation parameters:
q(z1:m∣ν)
then find a setting of ν which makes it closest to the posterior of interest.

The closeness can be measured by Kullback-Leibler (KL) divergence:
KL(q∥p)=EZ∼q[logp(Z∣x)q(Z)]
We use the arguments (q and p) in this order specifically to take expectation over q. If you flip the order (i.e., KL(p∥q)), that is called expectation propagation. It's a different kind of variational inference and it's more computationally expensive in general.
We cannot minimize this KL divergence directly (why??).
But we can minimize a function that is equal to it up to a constant (ELBO).
The Evidence Lower Bound (ELBO)
KL(q(z)∥p(z∣x))=Eq[logp(Z∣x)q(Z)]=Eq[logq(Z)]−Eq[logp(Z∣x)]=Eq[logq(Z)]−Eq[logp(Z,x)−logp(x))]=Eq[logq(Z)]−Eq[logp(Z,x)]+logp(x)=−(Eq[logp(Z,x)]−Eq[logq(Z)])+logp(x)
Notes:
- The last term logp(x) is independent of q, thus:
- Minimizing KL divergence is equiv to Maximizing ELBO
:::message
ELBO derivation
Using Jensen's inequality:
logp(x)=log∫zp(x,z)dz=log∫zq(z)q(z)p(x,z)dz=log(Eq[q(z)p(x,Z)])≥Eq[logp(x,Z)]−Eq[logq(Z)] (∵Jensen’s inequality).
Another derivation: forcibly extract KL divergence
logp(x)=Eq[logp(x)]=Eq[log{p(x)⋅p(z∣x)q(z)p(z∣x)q(z)}] (Stupid technique to make KL term)=Eq[logq(z)p(x,z)+logp(z∣x)q(z)]=Eq[logp(x,z)−logq(z)]+Eq[logp(z∣x)q(z)]=Eq[logp(x,z)−logq(z)]+KL(q(z)∥p(z∣x))≥Eq[logp(x,Z)]−Eq[logq(Z)] (∵KL divergence is non-negative).
The left hand side is called evidence probability. Hence ELBO.
:::
The difference between the ELBO and the KL divergence is the log normalizer --- which is what the ELBO bounds (???).
Variational Auto Encoder (Pretty much the same thing)
Latent variable models:
PΦ,Θ(z,x)PΦ,Θ(z∣x)=PΦ(z)PΘ(x∣z)=∫zPΦ,Θ(z,x)PΦ,Θ(z,x)
We have data population, so we want to estimate Φ and Θ based on it:
Φ∗,Θ∗=argminΦ,ΘEx∼Pop−logPΦ,Θ(x)
The problem is that we can't typically compute PΦ,Θ(x).
- PΦ,Θ(x)=∫zPΦ(z)PΘ(x∣z)dz doesn't work as the sum is too large
- The same sum but with importance sampling with PΦ,Θ(z∣x) is a better idea but doesn't work: (why???)
Variational Bayes sidesteps this with a model PΨ(z∣x) that approximate PΦ,Θ(z∣x).
The ELBO:
logPΦ,Θ(x)≥Ez∼PΨ[logPΦ,Θ(z,x)]−Ez∼PΨ[logPΨ(z∣x)]=Ez∼PΨ[logPΘ(x∣z)PΦ(z)]−Ez∼PΨ[logPΨ(z∣x)]=Ez∼PΨ[−(logPΦ(z)PΨ(z∣x)−logPΘ(x∣z))]
Thus,
Φ∗,Θ∗,Ψ∗=argmin Ex∼Pop, z∼PΨ[logPΦ(z)PΨ(z∣x)−PΘ(x∣z)]
Minor but important: we can't do gradient descent w.r.t. Ψ as there's sampling procedure. We use re-parameterization trick to circumvent this.
- PΦ(z): the prior
- PΨ(z∣x): the encoder
- PΘ(x∣z): the decoder
- E[logPΨ(z∣x)/PΦ(z)]: rate term, KL-divergence
- E[−logPΘ(x∣z)]: distortion, Conditional entropy
:::message
Something more that are covered in TTIC31230
- EM (Expectation-Maximization) algorithm is indeed a specific instantiation of VAE!
EM corresponds to minimiing the VAE objective:
- First w.r.t. encoder (Ψ): Inference step -- E step
- And then w.r.t. Φ and Θ, while fixing Ψ: Update step -- M step
- VAE is exactly the same as Rate Distortion Autoencoder (RDA)
:::
Rate-Distortion Autoencoders (mathematically the same as VAE)
Setting: Image compression where an image x is compressed to z.
We assume a stochastic compression algorithm (encoder): Penc(z∣x)
- H(z): The number of bits needed for the compressed file. This is the rate (bits / image) for transmitting compressed images
- This is modeled with a prior model Ppri(z)
- H(x∣z): The number of additional bits needed to exactly recover x. This is a measure of the distortion of x
- This is modeled with a decoder model Pdec(x∣z)