Overview
Direct Answer
Batch normalisation is a technique that rescales and recentres the inputs to each layer during neural network training by normalising activations across a mini-batch. This approach reduces internal covariate shift—the phenomenon where the distribution of layer inputs changes during training—thereby enabling faster convergence and improved stability.
How It Works
For each mini-batch, the technique computes the mean and variance of activations across training samples, then standardises these values using z-score normalisation. Learnable scale and shift parameters (gamma and beta) are then applied per feature, allowing the network to recover expressivity. During inference, a running estimate of population statistics computed from training batches replaces the mini-batch statistics.
Why It Matters
Normalisation dramatically accelerates training convergence, reduces sensitivity to weight initialisation, and enables use of higher learning rates, directly reducing time-to-deployment and computational cost. Organisations deploying deep learning systems benefit from improved model stability and generalisation performance, particularly when training on large datasets.
Common Applications
The technique is standard in convolutional neural networks for image classification, object detection, and computer vision pipelines. It is equally prevalent in recurrent architectures and transformer-based language models, where it stabilises training of very deep networks across NLP and recommendation systems.
Key Considerations
Batch normalisation introduces a dependency on batch size; very small batches produce unreliable statistics whilst very large batches reduce computational efficiency. The distinction between training and inference behaviour requires careful implementation, and layer normalisation or group normalisation may be preferable in certain contexts such as recurrent networks or variable-batch settings.
Cross-References(1)
More in Deep Learning
Representation Learning
ArchitecturesThe automatic discovery of data representations needed for feature detection or classification from raw data.
Vision Transformer
ArchitecturesA transformer architecture adapted for image recognition that divides images into patches and processes them as sequences, rivalling convolutional networks in visual tasks.
Fine-Tuning
ArchitecturesThe process of taking a pretrained model and further training it on a smaller, task-specific dataset.
Residual Network
Training & OptimisationA deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.
Data Parallelism
ArchitecturesA distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.
Model Parallelism
ArchitecturesA distributed training approach that partitions a model across multiple devices, enabling training of models too large to fit in a single accelerator's memory.
Rotary Positional Encoding
Training & OptimisationA position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.
State Space Model
ArchitecturesA sequence modelling architecture based on continuous-time dynamical systems that processes long sequences with linear complexity, offering an alternative to attention-based transformers.