Regularization

Regularization

BatchNorm

Normalizes each feature based on the batch statistics

Given a batch of bs=32bs=32 examples with feature dim d=50d=50, compute the mean and stddev over the 3232 examples per feature.

BatchNorm stats: μ,σR50\mu, \sigma \in \mathbb{R}^{50}.

Using this, BatchNorm layer computes the output yy:

zi=xiμiσiyi=γizi+βi\begin{align} z_i = \frac{x_i - \mu_i}{\sigma_i} \\ y_i = \gamma_i z_i + \beta_i \end{align}

γ,βRd\gamma, \beta \in \mathbb{R}^{d} are the parameters to learn.

Batch stats during inference

BatchNorm depends only the batch size, which is inconvenient for flexible inference. To resolve this, it maintains running stats of batch mean and stddev throughout the training and use them at inference time, instead of computing the batch stats on the fly.

LayerNorm

Normalizes each feature based on the statistics of the feature itself.

Examples within a batch doesn't interact with each other anymore. Given a batch of bs=32bs=32 examples with feature dim d=50d=50, compute the mean and stddev within each example, over the feature dim d=50d=50.

LayerNorm stats for the batch: μ,σR32\mu, \sigma \in \mathbb{R}^{32}.

Using this, LayerNorm computes the output yy:

zi=xiμ(n)σ(n)yi=γizi+βi\begin{align} z_i = \frac{x_i - \mu^{(n)}}{\sigma^{(n)}} \\ y_i = \gamma_i z_i + \beta_i \end{align}

μ(n)\mu^{(n)} or σ(n)\sigma^{(n)} simply refers to the scalar value corresponding to the nn-th example. γ,βRd\gamma, \beta \in \mathbb{R}^{d} are the parameters to learn.