Overview
Direct Answer
Gradient checkpointing is a memory optimisation technique that reduces peak GPU memory consumption during neural network training by selectively discarding intermediate activations during the forward pass and recomputing them on-demand during backpropagation. This approach trades increased computational cost for substantially lower memory requirements, enabling training of larger models or larger batch sizes on fixed hardware.
How It Works
During the forward pass, designated checkpoint layers store only their input activations whilst discarding intermediate values. During backpropagation, the forward computation is re-executed for selected segments to regenerate the discarded activations needed for gradient calculation. This selective recomputation strategy—typically applied to deep residual or transformer architectures—reduces memory scaling from linear to sub-linear with network depth whilst introducing modest computational overhead.
Why It Matters
Training state-of-the-art large language models and vision transformers often exceeds available GPU memory. Checkpointing enables organisations to train parameter-efficient variants of larger models within existing infrastructure budgets, avoiding costly hardware upgrades. This is particularly valuable in resource-constrained environments and reduces time-to-deployment for frontier models.
Common Applications
The technique is widely employed in training transformer models, large vision transformers, and deep convolutional networks where memory is the limiting factor. It is integral to frameworks supporting large-scale model training in natural language processing and computer vision research.
Key Considerations
The computational overhead typically ranges from 20–50% additional forward-pass computation, making the optimisation most effective when memory is the critical bottleneck rather than compute. Checkpoint granularity must be carefully selected to balance memory savings against recomputation cost; suboptimal choices can degrade wall-clock training speed despite reducing memory usage.
More in Deep Learning
Self-Attention
Training & OptimisationAn attention mechanism where each element in a sequence attends to all other elements to compute its representation.
Pre-Training
Language ModelsThe initial phase of training a deep learning model on a large unlabelled corpus using self-supervised objectives, establishing general-purpose representations for downstream adaptation.
Dropout
Training & OptimisationA regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.
Embedding
ArchitecturesA learned dense vector representation of discrete data (like words or categories) in a continuous vector space.
Layer Normalisation
Training & OptimisationA normalisation technique that normalises across the features of each individual sample rather than across the batch.
Parameter-Efficient Fine-Tuning
Language ModelsMethods for adapting large pretrained models to new tasks by only updating a small fraction of their parameters.
Pipeline Parallelism
ArchitecturesA form of model parallelism that splits neural network layers across devices and pipelines micro-batches through stages, maximising hardware utilisation during training.
Gradient Clipping
Training & OptimisationA technique that caps gradient values during training to prevent the exploding gradient problem.