Overview
Direct Answer
Vanishing gradient is a training pathology in deep neural networks where gradients computed during backpropagation shrink exponentially as they propagate backwards through layers, approaching zero and effectively halting weight updates in earlier layers. This prevents shallow layers from learning meaningful representations and is particularly acute in recurrent and very deep feedforward architectures.
How It Works
During backpropagation, gradients are multiplied together across layers via the chain rule. When activation functions like sigmoid or tanh compress outputs to small ranges and have small derivatives, successive multiplications produce increasingly tiny values. In recurrent networks, the same weight matrix is applied repeatedly across time steps, compounding this attenuation effect and leaving parameters from distant time steps unable to adjust.
Why It Matters
Training convergence becomes prohibitively slow or stalls entirely, increasing computational cost and time-to-model without improving accuracy. This directly impacts feasibility of training deeper architectures that could capture more complex patterns, limiting model capacity and performance on tasks requiring hierarchical feature learning.
Common Applications
Deep convolutional networks for image recognition, recurrent networks for sequence modelling in natural language processing and time-series forecasting, and encoder-decoder architectures for machine translation and speech recognition suffer most acutely from this problem.
Key Considerations
Modern mitigation techniques including ReLU activation functions, batch normalisation, residual connections, and gradient clipping have substantially reduced prevalence, though the underlying issue remains relevant for architecture design and hyperparameter selection in very deep models.
Cross-References(1)
More in Deep Learning
Weight Decay
ArchitecturesA regularisation technique that penalises large model weights during training by adding a fraction of the weight magnitude to the loss function, preventing overfitting.
Representation Learning
ArchitecturesThe automatic discovery of data representations needed for feature detection or classification from raw data.
Embedding
ArchitecturesA learned dense vector representation of discrete data (like words or categories) in a continuous vector space.
Weight Initialisation
ArchitecturesThe strategy for setting initial parameter values in a neural network before training begins.
Mixed Precision Training
Training & OptimisationTraining neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.
Mixture of Experts
ArchitecturesAn architecture where different specialised sub-networks (experts) are selectively activated based on the input.
Key-Value Cache
ArchitecturesAn optimisation in autoregressive transformer inference that stores previously computed key and value tensors to avoid redundant computation during sequential token generation.
Prefix Tuning
Language ModelsA parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.