Deep LearningArchitectures

Vanishing Gradient

Overview

Direct Answer

Vanishing gradient is a training pathology in deep neural networks where gradients computed during backpropagation shrink exponentially as they propagate backwards through layers, approaching zero and effectively halting weight updates in earlier layers. This prevents shallow layers from learning meaningful representations and is particularly acute in recurrent and very deep feedforward architectures.

How It Works

During backpropagation, gradients are multiplied together across layers via the chain rule. When activation functions like sigmoid or tanh compress outputs to small ranges and have small derivatives, successive multiplications produce increasingly tiny values. In recurrent networks, the same weight matrix is applied repeatedly across time steps, compounding this attenuation effect and leaving parameters from distant time steps unable to adjust.

Why It Matters

Training convergence becomes prohibitively slow or stalls entirely, increasing computational cost and time-to-model without improving accuracy. This directly impacts feasibility of training deeper architectures that could capture more complex patterns, limiting model capacity and performance on tasks requiring hierarchical feature learning.

Common Applications

Deep convolutional networks for image recognition, recurrent networks for sequence modelling in natural language processing and time-series forecasting, and encoder-decoder architectures for machine translation and speech recognition suffer most acutely from this problem.

Key Considerations

Modern mitigation techniques including ReLU activation functions, batch normalisation, residual connections, and gradient clipping have substantially reduced prevalence, though the underlying issue remains relevant for architecture design and hyperparameter selection in very deep models.

Cross-References(1)

Machine Learning

More in Deep Learning

See Also