Variational Inference
References
David Blei has many good write-ups / talks on this topic
- https://youtu.be/Dv86zdWjJKQ
- https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf
Others
- https://ermongroup.github.io/cs228-notes/inference/variational/
- https://turing.ml/dev/docs/for-developers/variational_inference
- TTIC-31230 - Fundamentals of Deep Learning (ELBO)
- CS 285 (Sergey Levine)
Intro
Assumption:
: observationsx = x_{1:n} : hidden variablesz = z_{1:m}
We want to model the posterior distribution (notice that this is a form of inference: estimating a hidden variable from observations):
The posterior links the data and a model.
In most of the interesting problems, calculating denominator is not tractable.
x is the evidence about z.
The main idea: pick a family of distributions over the latent variables parameterized with its own variation parameters:
then find a setting of
The closeness can be measured by Kullback-Leibler (KL) divergence:
We use the arguments (
We cannot minimize this KL divergence directly (why??).
But we can minimize a function that is equal to it up to a constant (ELBO).
The Evidence Lower Bound (ELBO)
Notes:
- The last term
is independent of\log p(x) , thus:q - Minimizing KL divergence is equiv to Maximizing ELBO
The difference between the ELBO and the KL divergence is the log normalizer --- which is what the ELBO bounds (???).
Variational Auto Encoder (Pretty much the same thing)
Latent variable models:
We have data population, so we want to estimate
The problem is that we can't typically compute
doesn't work as the sum is too largeP_{\Phi, \Theta}(x) = \int_z P_\Phi (z) P_\Theta (x|z) dz - The same sum but with importance sampling with
is a better idea but doesn't work: (why???)P_{\Phi, \Theta}(z|x)
Variational Bayes sidesteps this with a model
The ELBO:
Thus,
Minor but important: we can't do gradient descent w.r.t.
: the priorP_\Phi(z) : the encoderP_\Psi(z|x) : the decoderP_\Theta(x|z) : rate term, KL-divergenceE[\log P_\Psi(z|x)/P_\Phi(z)] : distortion, Conditional entropyE[- \log P_\Theta (x | z)]
Rate-Distortion Autoencoders (mathematically the same as VAE)
Setting: Image compression where an image
We assume a stochastic compression algorithm (encoder):
: The number of bits needed for the compressed file. This is the rate (bits / image) for transmitting compressed imagesH(z) - This is modeled with a prior model
P_\text{pri}(z)
- This is modeled with a prior model
: The number of additional bits needed to exactly recoverH(x|z) . This is a measure of the distortion ofx x - This is modeled with a decoder model
P_\text{dec}(x|z)
- This is modeled with a decoder model