Gradient Checkpointing — Technology Wiki

Overview

Direct Answer

Gradient checkpointing is a memory optimisation technique that reduces peak GPU memory consumption during neural network training by selectively discarding intermediate activations during the forward pass and recomputing them on-demand during backpropagation. This approach trades increased computational cost for substantially lower memory requirements, enabling training of larger models or larger batch sizes on fixed hardware.

How It Works

During the forward pass, designated checkpoint layers store only their input activations whilst discarding intermediate values. During backpropagation, the forward computation is re-executed for selected segments to regenerate the discarded activations needed for gradient calculation. This selective recomputation strategy—typically applied to deep residual or transformer architectures—reduces memory scaling from linear to sub-linear with network depth whilst introducing modest computational overhead.

Why It Matters

Training state-of-the-art large language models and vision transformers often exceeds available GPU memory. Checkpointing enables organisations to train parameter-efficient variants of larger models within existing infrastructure budgets, avoiding costly hardware upgrades. This is particularly valuable in resource-constrained environments and reduces time-to-deployment for frontier models.

Common Applications

The technique is widely employed in training transformer models, large vision transformers, and deep convolutional networks where memory is the limiting factor. It is integral to frameworks supporting large-scale model training in natural language processing and computer vision research.

Key Considerations

The computational overhead typically ranges from 20–50% additional forward-pass computation, making the optimisation most effective when memory is the critical bottleneck rather than compute. Checkpoint granularity must be carefully selected to balance memory savings against recomputation cost; suboptimal choices can degrade wall-clock training speed despite reducing memory usage.

Related in Architectures

Deep Learning

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Neural Network

A computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.

Convolutional Neural Network

A deep learning architecture designed for processing structured grid data like images, using convolutional filters to detect features.

Recurrent Neural Network

A neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.

Long Short-Term Memory

A recurrent neural network architecture designed to learn long-term dependencies by using gating mechanisms to control information flow.

Gated Recurrent Unit

A simplified variant of LSTM that combines the forget and input gates into a single update gate.

Transformer

A neural network architecture based entirely on attention mechanisms, eliminating recurrence and enabling parallel processing of sequences.

Attention Mechanism

A neural network component that learns to focus on relevant parts of the input when producing each element of the output.

Encoder-Decoder Architecture

A neural network design where an encoder processes input into a fixed representation and a decoder generates output from it.

Autoencoder

A neural network trained to encode input data into a compressed representation and then decode it back to reconstruct the original.

Variational Autoencoder

A generative model that learns a probabilistic latent space representation, enabling generation of new data samples.

Batch Normalisation

A technique that normalises layer inputs during training to stabilise and accelerate deep neural network learning.

More in Deep Learning

Graph Neural Network

Architectures

A neural network designed to operate on graph-structured data, learning representations of nodes, edges, and entire graphs.

Attention Head

Training & Optimisation

An individual attention computation within a multi-head attention layer that learns to focus on different aspects of the input, with outputs concatenated for richer representations.

Contrastive Learning

Architectures

A self-supervised learning approach that trains models by comparing similar and dissimilar pairs of data representations.

Pooling Layer

Architectures

A neural network layer that reduces spatial dimensions by aggregating values, commonly using max or average operations.

Prefix Tuning

Language Models

A parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.

ReLU

Training & Optimisation

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Softmax Function

Training & Optimisation

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Dropout

Training & Optimisation

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.