# Variational Inference

## References

David Blei has many good write-ups / talks on this topic

- https://youtu.be/Dv86zdWjJKQ
- https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

Others

- https://ermongroup.github.io/cs228-notes/inference/variational/
- https://turing.ml/dev/docs/for-developers/variational_inference
- TTIC-31230 - Fundamentals of Deep Learning (ELBO)
- CS 285 (Sergey Levine)

# Intro

Assumption:

: observationsx = x_{1:n} : hidden variablesz = z_{1:m}

We want to model the posterior distribution (notice that this is a form of *inference*: estimating a hidden variable from observations):

The posterior links the data and a model.

In most of the interesting problems, calculating denominator is not tractable.

x is the *evidence* about z.

The main idea: pick a family of distributions over the latent variables parameterized with its own **variation parameters**:

then find a setting of

The *closeness* can be measured by Kullback-Leibler (KL) divergence:

We use the arguments (*expectation propagation*. It's a different kind of variational inference and it's more computationally expensive in general.

We cannot minimize this KL divergence directly (why??).

But we can minimize a function that is equal to it up to a constant (ELBO).

**The Evidence Lower Bound (ELBO)**

Notes:

- The last term
is independent of\log p(x) , thus:q **Minimizing KL divergence**is equiv to**Maximizing ELBO**

The difference between the ELBO and the KL divergence is the log normalizer --- which is what the ELBO bounds (???).

# Variational Auto Encoder (Pretty much the same thing)

Latent variable models:

We have data population, so we want to estimate

The problem is that we can't typically compute

doesn't work as the sum is too largeP_{\Phi, \Theta}(x) = \int_z P_\Phi (z) P_\Theta (x|z) dz - The same sum but with importance sampling with
is a better idea but doesn't work: (why???)P_{\Phi, \Theta}(z|x)

Variational Bayes sidesteps this with a model

The ELBO:

Thus,

Minor but important: we can't do gradient descent w.r.t.

: the priorP_\Phi(z) : the encoderP_\Psi(z|x) : the decoderP_\Theta(x|z) :E[\log P_\Psi(z|x)/P_\Phi(z)] *rate term*, KL-divergence :E[- \log P_\Theta (x | z)] *distortion*, Conditional entropy

## Rate-Distortion Autoencoders (mathematically the same as VAE)

Setting: Image compression where an image

We assume a stochastic compression algorithm (*encoder*):

: The number of bits needed for the compressed file. This is theH(z) *rate*(bits / image) for transmitting compressed images- This is modeled with a prior model
P_\text{pri}(z)

- This is modeled with a prior model
: The number of additional bits needed to exactly recoverH(x|z) . This is a measure of thex *distortion*ofx - This is modeled with a decoder model
P_\text{dec}(x|z)

- This is modeled with a decoder model