Regularization

Apr 12, 2026

Regularization

BatchNorm

Normalizes each feature based on the batch statistics

Given a batch of $bs=32$ examples with feature dim $d=50$ , compute the mean and stddev over the $32$ examples per feature.

BatchNorm stats: $\mu, \sigma \in \mathbb{R}^{50}$ .

Using this, BatchNorm layer computes the output $y$ :

\begin{align} z_i = \frac{x_i - \mu_i}{\sigma_i} \\ y_i = \gamma_i z_i + \beta_i \end{align}

$\gamma, \beta \in \mathbb{R}^{d}$ are the parameters to learn.

Batch stats during inference

BatchNorm depends only the batch size, which is inconvenient for flexible inference. To resolve this, it maintains running stats of batch mean and stddev throughout the training and use them at inference time, instead of computing the batch stats on the fly.

LayerNorm

Normalizes each feature based on the statistics of the feature itself.

Examples within a batch doesn't interact with each other anymore. Given a batch of $bs=32$ examples with feature dim $d=50$ , compute the mean and stddev within each example, over the feature dim $d=50$ .

LayerNorm stats for the batch: $\mu, \sigma \in \mathbb{R}^{32}$ .

Using this, LayerNorm computes the output $y$ :

\begin{align} z_i = \frac{x_i - \mu^{(n)}}{\sigma^{(n)}} \\ y_i = \gamma_i z_i + \beta_i \end{align}

$\mu^{(n)}$ or $\sigma^{(n)}$ simply refers to the scalar value corresponding to the $n$ -th example. $\gamma, \beta \in \mathbb{R}^{d}$ are the parameters to learn.