Model Parallelism — Technology Wiki

Overview

Direct Answer

Model parallelism is a distributed training technique that splits a neural network's layers or components across multiple devices (GPUs, TPUs, or accelerators), allowing training of architectures that exceed the memory capacity of any single device. This contrasts with data parallelism, which replicates the full model across devices and partitions input data.

How It Works

During forward and backward propagation, activations and gradients flow between devices as data moves through different model segments. Each device computes gradients for its assigned layers, then synchronises updates across the distributed system. Pipeline parallelism variants (such as GPipe) further optimise throughput by overlapping computation on different microbatches across the device stages.

Why It Matters

As transformer models and vision architectures grow to billions or trillions of parameters, single-device training becomes infeasible. Model partitioning enables organisations to train state-of-the-art language models and multimodal systems within existing hardware budgets, reducing infrastructure costs whilst maintaining competitive model quality.

Common Applications

Large language model training (billions of parameters), multimodal foundation models combining vision and language components, and federated learning scenarios where partial model segments run on resource-constrained edge devices. Research institutions and cloud providers employ these techniques for transformer pre-training pipelines.

Key Considerations

Communication overhead between devices can significantly reduce efficiency; device utilisation may be uneven if layer computation times differ substantially. Careful pipeline scheduling and gradient accumulation strategies are essential to mitigate these bottlenecks.

Referenced By1 term mentions Model Parallelism

Other entries in the wiki whose definition references Model Parallelism — useful for understanding how this concept connects across Deep Learning and adjacent domains.

Pipeline Parallelism·Deep Learning

Related in Architectures

Deep Learning

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Neural Network

A computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.

Convolutional Neural Network

A deep learning architecture designed for processing structured grid data like images, using convolutional filters to detect features.

Recurrent Neural Network

A neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.

Long Short-Term Memory

A recurrent neural network architecture designed to learn long-term dependencies by using gating mechanisms to control information flow.

Gated Recurrent Unit

A simplified variant of LSTM that combines the forget and input gates into a single update gate.

Transformer

A neural network architecture based entirely on attention mechanisms, eliminating recurrence and enabling parallel processing of sequences.

Attention Mechanism

A neural network component that learns to focus on relevant parts of the input when producing each element of the output.

Encoder-Decoder Architecture

A neural network design where an encoder processes input into a fixed representation and a decoder generates output from it.

Autoencoder

A neural network trained to encode input data into a compressed representation and then decode it back to reconstruct the original.

Variational Autoencoder

A generative model that learns a probabilistic latent space representation, enabling generation of new data samples.

Batch Normalisation

A technique that normalises layer inputs during training to stabilise and accelerate deep neural network learning.

More in Deep Learning

Contrastive Learning

Architectures

A self-supervised learning approach that trains models by comparing similar and dissimilar pairs of data representations.

Dropout

Training & Optimisation

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Weight Initialisation

Architectures

The strategy for setting initial parameter values in a neural network before training begins.

Positional Encoding

Training & Optimisation

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Self-Attention

Training & Optimisation

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Knowledge Distillation

Architectures

A model compression technique where a smaller student model learns to mimic the behaviour of a larger teacher model.

Pretraining

Architectures

Training a model on a large general dataset before fine-tuning it on a specific downstream task.

Fully Connected Layer

Architectures

A neural network layer where every neuron is connected to every neuron in the adjacent layers.