Overview
Direct Answer
Gradient clipping is a regularisation technique that constrains the magnitude of gradients during backpropagation to prevent their unbounded growth. By capping gradient values to a specified threshold, it stabilises training in deep networks prone to explosive gradient escalation.
How It Works
During each backpropagation pass, gradients are computed through the network layers. If the norm or individual values exceed a predefined threshold, they are rescaled proportionally to remain within bounds. Common approaches include L2 norm clipping (limiting vector magnitude) or element-wise clipping (bounding individual gradient components).
Why It Matters
Exploding gradients destabilise training, cause numerical overflow, and degrade model convergence—particularly in recurrent neural networks and very deep architectures. Clipping enables reliable training in challenging scenarios, reduces computational overhead from loss scaling workarounds, and improves model robustness across diverse initialisation schemes.
Common Applications
The technique is standard in natural language processing models, particularly sequence-to-sequence architectures and transformers. It is also employed in reinforcement learning policy gradient methods and in training deep convolutional networks on tasks with variable-length sequences.
Key Considerations
Aggressive clipping thresholds may impede gradient flow and slow convergence, whilst lenient thresholds offer minimal protection. The optimal threshold is dataset and architecture dependent, requiring empirical tuning alongside monitoring of gradient statistics.
Cross-References(1)
More in Deep Learning
Autoencoder
ArchitecturesA neural network trained to encode input data into a compressed representation and then decode it back to reconstruct the original.
Attention Mechanism
ArchitecturesA neural network component that learns to focus on relevant parts of the input when producing each element of the output.
Long Short-Term Memory
ArchitecturesA recurrent neural network architecture designed to learn long-term dependencies by using gating mechanisms to control information flow.
Vision Transformer
ArchitecturesA transformer architecture adapted for image recognition that divides images into patches and processes them as sequences, rivalling convolutional networks in visual tasks.
Key-Value Cache
ArchitecturesAn optimisation in autoregressive transformer inference that stores previously computed key and value tensors to avoid redundant computation during sequential token generation.
Residual Connection
Training & OptimisationA skip connection that adds a layer's input directly to its output, enabling gradient flow through deep networks and allowing training of architectures with hundreds of layers.
Convolutional Layer
ArchitecturesA neural network layer that applies learnable filters across input data to detect local patterns and features.
Transformer
ArchitecturesA neural network architecture based entirely on attention mechanisms, eliminating recurrence and enabling parallel processing of sequences.