Overview
Direct Answer
Exploding gradient is a numerical instability during backpropagation in which gradients accumulate multiplicatively across layers, reaching excessively large values that cause weight updates to overshoot optimal parameters and destabilise training. This phenomenon is distinct from vanishing gradients and occurs most frequently in recurrent neural networks and very deep feedforward architectures.
How It Works
During backpropagation, gradients are computed via the chain rule by multiplying partial derivatives across layers. When activation function derivatives and weight matrices have values greater than one, successive multiplications produce exponentially growing gradient magnitudes. In recurrent networks, unrolling across many timesteps amplifies this effect, leading to NaN or Inf values in weight updates that render the model untrainable within a few iterations.
Why It Matters
Training instability directly increases computational cost through failed training runs and necessitates careful hyperparameter selection. In production pipelines, unstable training reduces model reliability and increases time-to-deployment for sequential architectures used in natural language processing and time-series forecasting, where recurrence is fundamental.
Common Applications
The problem is prevalent in long short-term memory networks, gated recurrent units, and multi-layer perceptrons exceeding 10–20 layers. Applications include machine translation, speech recognition, and financial forecasting where temporal dependencies require deep or recurrent architectures.
Key Considerations
Gradient clipping and normalisation techniques mitigate the issue but introduce hyperparameter tuning overhead. The severity depends on weight initialisation strategy and activation function choice, requiring practitioners to balance architectural expressiveness against training stability.
Cross-References(1)
Referenced By1 term mentions Exploding Gradient
Other entries in the wiki whose definition references Exploding Gradient — useful for understanding how this concept connects across Deep Learning and adjacent domains.
More in Deep Learning
Pretraining
ArchitecturesTraining a model on a large general dataset before fine-tuning it on a specific downstream task.
Dropout
Training & OptimisationA regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.
Self-Attention
Training & OptimisationAn attention mechanism where each element in a sequence attends to all other elements to compute its representation.
Rotary Positional Encoding
Training & OptimisationA position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.
Key-Value Cache
ArchitecturesAn optimisation in autoregressive transformer inference that stores previously computed key and value tensors to avoid redundant computation during sequential token generation.
Tensor Parallelism
ArchitecturesA distributed computing strategy that splits individual layer computations across multiple devices by partitioning weight matrices along specific dimensions.
Weight Decay
ArchitecturesA regularisation technique that penalises large model weights during training by adding a fraction of the weight magnitude to the loss function, preventing overfitting.
Mixed Precision Training
Training & OptimisationTraining neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.