Gradient Clipping — Technology Wiki

Overview

Direct Answer

Gradient clipping is a regularisation technique that constrains the magnitude of gradients during backpropagation to prevent their unbounded growth. By capping gradient values to a specified threshold, it stabilises training in deep networks prone to explosive gradient escalation.

How It Works

During each backpropagation pass, gradients are computed through the network layers. If the norm or individual values exceed a predefined threshold, they are rescaled proportionally to remain within bounds. Common approaches include L2 norm clipping (limiting vector magnitude) or element-wise clipping (bounding individual gradient components).

Why It Matters

Exploding gradients destabilise training, cause numerical overflow, and degrade model convergence—particularly in recurrent neural networks and very deep architectures. Clipping enables reliable training in challenging scenarios, reduces computational overhead from loss scaling workarounds, and improves model robustness across diverse initialisation schemes.

Common Applications

The technique is standard in natural language processing models, particularly sequence-to-sequence architectures and transformers. It is also employed in reinforcement learning policy gradient methods and in training deep convolutional networks on tasks with variable-length sequences.

Key Considerations

Aggressive clipping thresholds may impede gradient flow and slow convergence, whilst lenient thresholds offer minimal protection. The optimal threshold is dataset and architecture dependent, requiring empirical tuning alongside monitoring of gradient statistics.

Cross-References(1)

Deep Learning

Exploding Gradient

Related in Training & Optimisation

Self-Attention

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Multi-Head Attention

An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.

Residual Network

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Layer Normalisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Dropout

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Activation Function

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

ReLU

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Positional Encoding

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Mixed Precision Training

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Rotary Positional Encoding

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

More in Deep Learning

Pipeline Parallelism

Architectures

A form of model parallelism that splits neural network layers across devices and pipelines micro-batches through stages, maximising hardware utilisation during training.

Adapter Layers

Language Models

Small trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.

Encoder-Decoder Architecture

Architectures

A neural network design where an encoder processes input into a fixed representation and a decoder generates output from it.

Prefix Tuning

Language Models

A parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.

Convolutional Layer

Architectures

A neural network layer that applies learnable filters across input data to detect local patterns and features.

Mixture of Experts

Architectures

An architecture where different specialised sub-networks (experts) are selectively activated based on the input.

Word Embedding

Language Models

Dense vector representations of words where semantically similar words are mapped to nearby points in vector space.

Long Short-Term Memory

Architectures

A recurrent neural network architecture designed to learn long-term dependencies by using gating mechanisms to control information flow.