Consistent Motivation:

We want to avoid gradient vanishing / exploding

Initialization

Xavier initialization
The goal: the variance of the activations are the same across every layer: Var(a^{[l-1]}) = Var(a^{[l]})
How to achieve this:

W_{i,j}^{[l]} = \mathcal{N}(0, \frac{1}{n^{[l-1]}})
• l: layer index
• n^{[l]}: width of the layer (dimension)

It works well with tanh or sigmoid activation functions.
For ReLU, we should checkout Kaiming Initialization

Intuition:
We can derive Xavier initialization by enforcing .

Norm layers

Batch Normalization

\begin{align*} \hat{\mu}[j] &= \frac{1}{B}\sum_b x[b, j] \\ \hat{\sigma}[j] &= \sqrt{\frac{1}{B-1} \sum_b (x[b, j] - \hat{\mu}[j])^2} \\ \tilde{x}[b, j] &= \frac{x[b, j] -\hat{\mu}[j]}{\hat{\sigma}[j]} \end{align*}

At test time, a fixed estimate of \mu[j] and \sigma[j] can be used.
(Typically, the BN layer keeps track of running mean of \mu[j] and \sigma[j], and use that values at inference; This is consistent with "Inference" part of Wikipedia page)
Affine transformation

x'[b, j] = \gamma[j]\tilde{x}[b, j] + \beta[j]

It is typically used prior to a nonlinear function.

Hmmm I still don't see why changing batch size at test time completely changes the performance. For example, when I train U-Net with ResNet blocks (that include BN layers) with batchsize 256 and tested it in a stream data (batchsize = 1), the performance completely plummeted, while if I set the batchsize to the same as during training it worked almost perfectly.

Also funnily, the behavior differed depending on the heads: segmentation heads work somewhat consistently, whereas a scalar prediction head was completely broken (always predicting middle value 0.5).