Exploding Gradient

Overview

Direct Answer

Exploding gradient is a numerical instability during backpropagation in which gradients accumulate multiplicatively across layers, reaching excessively large values that cause weight updates to overshoot optimal parameters and destabilise training. This phenomenon is distinct from vanishing gradients and occurs most frequently in recurrent neural networks and very deep feedforward architectures.

How It Works

During backpropagation, gradients are computed via the chain rule by multiplying partial derivatives across layers. When activation function derivatives and weight matrices have values greater than one, successive multiplications produce exponentially growing gradient magnitudes. In recurrent networks, unrolling across many timesteps amplifies this effect, leading to NaN or Inf values in weight updates that render the model untrainable within a few iterations.

Why It Matters

Training instability directly increases computational cost through failed training runs and necessitates careful hyperparameter selection. In production pipelines, unstable training reduces model reliability and increases time-to-deployment for sequential architectures used in natural language processing and time-series forecasting, where recurrence is fundamental.

Common Applications

The problem is prevalent in long short-term memory networks, gated recurrent units, and multi-layer perceptrons exceeding 10–20 layers. Applications include machine translation, speech recognition, and financial forecasting where temporal dependencies require deep or recurrent architectures.

Key Considerations

Gradient clipping and normalisation techniques mitigate the issue but introduce hyperparameter tuning overhead. The severity depends on weight initialisation strategy and activation function choice, requiring practitioners to balance architectural expressiveness against training stability.

Cross-References(1)

Machine Learning

Backpropagation

Referenced By1 term mentions Exploding Gradient

Other entries in the wiki whose definition references Exploding Gradient — useful for understanding how this concept connects across Deep Learning and adjacent domains.

Gradient Clipping·Deep Learning

Related in Architectures

Deep Learning

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Neural Network

A computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.

Convolutional Neural Network

A deep learning architecture designed for processing structured grid data like images, using convolutional filters to detect features.

Recurrent Neural Network

A neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.

Long Short-Term Memory

A recurrent neural network architecture designed to learn long-term dependencies by using gating mechanisms to control information flow.

Gated Recurrent Unit

A simplified variant of LSTM that combines the forget and input gates into a single update gate.

Transformer

A neural network architecture based entirely on attention mechanisms, eliminating recurrence and enabling parallel processing of sequences.

Attention Mechanism

A neural network component that learns to focus on relevant parts of the input when producing each element of the output.

Encoder-Decoder Architecture

A neural network design where an encoder processes input into a fixed representation and a decoder generates output from it.

Autoencoder

A neural network trained to encode input data into a compressed representation and then decode it back to reconstruct the original.

Variational Autoencoder

A generative model that learns a probabilistic latent space representation, enabling generation of new data samples.

Batch Normalisation

A technique that normalises layer inputs during training to stabilise and accelerate deep neural network learning.

More in Deep Learning

Rotary Positional Encoding

Training & Optimisation

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

Diffusion Model

Generative Models

A generative model that learns to reverse a gradual noising process, generating high-quality samples from random noise.

Activation Function

Training & Optimisation

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

Mixture of Experts

Architectures

An architecture where different specialised sub-networks (experts) are selectively activated based on the input.

State Space Model

Architectures

A sequence modelling architecture based on continuous-time dynamical systems that processes long sequences with linear complexity, offering an alternative to attention-based transformers.

ReLU

Training & Optimisation

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Positional Encoding

Training & Optimisation

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Data Parallelism

Architectures

A distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.

Overview

Direct Answer

How It Works

Why It Matters

Common Applications

Key Considerations

Cross-References(1)

Referenced By1 term mentions Exploding Gradient

Related in Architectures

Deep Learning

Neural Network

Convolutional Neural Network

Recurrent Neural Network

Long Short-Term Memory

Gated Recurrent Unit

Transformer

Attention Mechanism

Encoder-Decoder Architecture

Autoencoder

Variational Autoencoder

Batch Normalisation

More in Deep Learning

Rotary Positional Encoding

Diffusion Model

Activation Function

Mixture of Experts

State Space Model

ReLU

Positional Encoding

Data Parallelism

See Also

Backpropagation