Overview
Direct Answer
Layer Normalisation is a technique that normalises the activations of a neural network by computing statistics (mean and variance) across the feature dimension for each individual sample, independent of batch composition. This differs from batch normalisation, which normalises across the batch dimension whilst preserving per-sample feature variation.
How It Works
For each sample, the algorithm computes the mean and standard deviation across all features in a given layer, then rescales activations using learnable affine parameters (gain and bias). The normalisation is applied independently to each sample, making it invariant to batch size and batch composition. This approach is particularly effective in recurrent and transformer architectures where temporal or sequential dependencies exist within samples.
Why It Matters
Layer normalisation stabilises training in models where batch statistics are unreliable or unavailable—notably recurrent neural networks, sequence-to-sequence models, and transformer-based architectures. It improves convergence speed, reduces sensitivity to initialisation, and enables robust performance across variable batch sizes, directly enhancing model robustness and training efficiency in production systems.
Common Applications
This technique is foundational in transformer models used for natural language processing, machine translation, and large language models. It is also employed in recurrent architectures for time-series forecasting, speech recognition systems, and reinforcement learning agents where batch normalisation is impractical or ineffective.
Key Considerations
Layer normalisation introduces additional computational overhead per sample and may be less effective than batch normalisation in fully-connected feedforward networks where batch statistics are stable. Performance characteristics vary significantly depending on architecture choice and problem domain, requiring empirical validation during model development.
More in Deep Learning
Recurrent Neural Network
ArchitecturesA neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.
Long Short-Term Memory
ArchitecturesA recurrent neural network architecture designed to learn long-term dependencies by using gating mechanisms to control information flow.
Prefix Tuning
Language ModelsA parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.
Data Parallelism
ArchitecturesA distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.
Deep Learning
ArchitecturesA subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.
Neural Network
ArchitecturesA computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.
Word Embedding
Language ModelsDense vector representations of words where semantically similar words are mapped to nearby points in vector space.
Weight Initialisation
ArchitecturesThe strategy for setting initial parameter values in a neural network before training begins.