Overview
Direct Answer
Model parallelism is a distributed training technique that splits a neural network's layers or components across multiple devices (GPUs, TPUs, or accelerators), allowing training of architectures that exceed the memory capacity of any single device. This contrasts with data parallelism, which replicates the full model across devices and partitions input data.
How It Works
During forward and backward propagation, activations and gradients flow between devices as data moves through different model segments. Each device computes gradients for its assigned layers, then synchronises updates across the distributed system. Pipeline parallelism variants (such as GPipe) further optimise throughput by overlapping computation on different microbatches across the device stages.
Why It Matters
As transformer models and vision architectures grow to billions or trillions of parameters, single-device training becomes infeasible. Model partitioning enables organisations to train state-of-the-art language models and multimodal systems within existing hardware budgets, reducing infrastructure costs whilst maintaining competitive model quality.
Common Applications
Large language model training (billions of parameters), multimodal foundation models combining vision and language components, and federated learning scenarios where partial model segments run on resource-constrained edge devices. Research institutions and cloud providers employ these techniques for transformer pre-training pipelines.
Key Considerations
Communication overhead between devices can significantly reduce efficiency; device utilisation may be uneven if layer computation times differ substantially. Careful pipeline scheduling and gradient accumulation strategies are essential to mitigate these bottlenecks.
Referenced By1 term mentions Model Parallelism
Other entries in the wiki whose definition references Model Parallelism — useful for understanding how this concept connects across Deep Learning and adjacent domains.
More in Deep Learning
Gradient Clipping
Training & OptimisationA technique that caps gradient values during training to prevent the exploding gradient problem.
Activation Function
Training & OptimisationA mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.
Mixed Precision Training
Training & OptimisationTraining neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.
Prefix Tuning
Language ModelsA parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.
Skip Connection
ArchitecturesA neural network shortcut that allows the output of one layer to bypass intermediate layers and be added to a later layer's output.
Self-Attention
Training & OptimisationAn attention mechanism where each element in a sequence attends to all other elements to compute its representation.
Embedding
ArchitecturesA learned dense vector representation of discrete data (like words or categories) in a continuous vector space.
Data Parallelism
ArchitecturesA distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.