Regularization
Regularization
BatchNorm
Normalizes each feature based on the batch statistics
Given a batch of examples with feature dim , compute the mean and stddev over the examples per feature.
BatchNorm stats: .
Using this, BatchNorm layer computes the output :
are the parameters to learn.
Batch stats during inference
BatchNorm depends only the batch size, which is inconvenient for flexible inference. To resolve this, it maintains running stats of batch mean and stddev throughout the training and use them at inference time, instead of computing the batch stats on the fly.
LayerNorm
Normalizes each feature based on the statistics of the feature itself.
Examples within a batch doesn't interact with each other anymore. Given a batch of examples with feature dim , compute the mean and stddev within each example, over the feature dim .
LayerNorm stats for the batch: .
Using this, LayerNorm computes the output :
or simply refers to the scalar value corresponding to the -th example. are the parameters to learn.