Normalization Layers
Consistent Motivation:
We want to avoid gradient vanishing / exploding
Initialization
Links
- DeepLearning.AI interactive note
- CS230 - Stanford
- TTIC-31230 Foundations of Deep Learning -- Trainability (Slide 7)
- Xavier Initialization original paper
Xavier initialization
The goal: the variance of the activations are the same across every layer:
How to achieve this:
: layer indexl : width of the layer (dimension)n^{[l]}
It works well with tanh or sigmoid activation functions.
For ReLU, we should checkout Kaiming Initialization
Intuition:
We can derive Xavier initialization by enforcing .
Norm layers
Batch Normalization
At test time, a fixed estimate of
(Typically, the BN layer keeps track of running mean of
Affine transformation
It is typically used prior to a nonlinear function.
Hmmm I still don't see why changing batch size at test time completely changes the performance. For example, when I train U-Net with ResNet blocks (that include BN layers) with batchsize 256 and tested it in a stream data (batchsize = 1), the performance completely plummeted, while if I set the batchsize to the same as during training it worked almost perfectly.
Also funnily, the behavior differed depending on the heads: segmentation heads work somewhat consistently, whereas a scalar prediction head was completely broken (always predicting middle value 0.5).