Overview
Direct Answer
Pipeline parallelism is a distributed training technique that partitions neural network layers across multiple devices and processes overlapping micro-batches through sequential stages to reduce idle time and maximise device utilisation. Unlike data parallelism, which replicates the full model across devices, this approach divides the model itself into stages that operate concurrently on different micro-batches.
How It Works
Each device holds a distinct set of consecutive layers, forming a pipeline stage. During forward propagation, micro-batch 1 advances through stage 1 while micro-batch 2 enters stage 1 and micro-batch 3 waits at the pipeline entrance. Backward propagation follows similarly, allowing devices to compute gradients while upstream stages process new data, thereby overlapping computation and communication to reduce bubble time—periods when devices remain idle waiting for dependencies.
Why It Matters
This approach enables training of extremely large models that exceed single-device memory capacity, directly reducing training time and hardware costs for organisations developing large language models and vision transformers. It addresses the memory bottleneck that prevents scaling beyond device VRAM limits, making feasible the training of multi-billion-parameter systems that would otherwise require prohibitively expensive hardware.
Common Applications
Pipeline parallelism is widely deployed in large-scale language model training by research institutions and cloud providers. It is essential for transformer-based architectures with 10+ billion parameters, particularly in natural language processing and multimodal AI development where models exceed individual GPU or TPU memory constraints.
Key Considerations
Pipeline bubble—idle device time between forward and backward passes—remains a fundamental efficiency loss; bubble fraction increases with deeper pipelines and smaller micro-batches. Practitioners must balance micro-batch size, pipeline depth, and gradient accumulation steps to optimise throughput whilst maintaining convergence behaviour and numerical stability.
Cross-References(3)
More in Deep Learning
Positional Encoding
Training & OptimisationA technique that injects information about the position of tokens in a sequence into transformer architectures.
Flash Attention
ArchitecturesAn IO-aware attention algorithm that reduces memory reads and writes by tiling the attention computation, enabling faster training of long-context transformer models.
Data Parallelism
ArchitecturesA distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.
Weight Initialisation
ArchitecturesThe strategy for setting initial parameter values in a neural network before training begins.
Key-Value Cache
ArchitecturesAn optimisation in autoregressive transformer inference that stores previously computed key and value tensors to avoid redundant computation during sequential token generation.
Contrastive Learning
ArchitecturesA self-supervised learning approach that trains models by comparing similar and dissimilar pairs of data representations.
Gradient Checkpointing
ArchitecturesA memory optimisation that trades computation for memory by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.
Vanishing Gradient
ArchitecturesA problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.