Overview
Direct Answer
Tensor parallelism is a distributed training strategy that partitions individual weight matrices and activation tensors across multiple devices along specific dimensions, enabling computation of a single model layer to occur in parallel. Unlike data parallelism, which replicates the entire model, this approach reduces memory footprint per device by distributing the mathematical operations of matrix multiplications themselves.
How It Works
During forward and backward propagation, weight matrices are split column-wise or row-wise across devices. Each device computes a partial result on its assigned tensor slice, then results are aggregated through collective operations (e.g. all-reduce). Communication overlaps with computation where feasible, minimising synchronisation overhead. The granularity and axis of partitioning depend on the layer type and target batch size.
Why It Matters
This approach enables training of exceptionally large models that would exceed single-device memory constraints, directly impacting capability and cost-efficiency in large language model and vision transformer development. Organisations prioritise it when model scale exceeds practical limits of other parallelism strategies, particularly when batch sizes cannot be increased freely.
Common Applications
Tensor parallelism is widely deployed in training large transformer-based language models and multimodal systems where model dimension is the primary scaling factor. It is frequently combined with pipeline and data parallelism in systems handling billions of parameters.
Key Considerations
Communication bandwidth between devices becomes a critical bottleneck; synchronous all-reduce operations can introduce substantial latency on slower interconnects. The strategy is most effective on high-bandwidth clusters and less suitable for models with small embedding or hidden dimensions relative to device count.
Cross-References(1)
More in Deep Learning
Pre-Training
Language ModelsThe initial phase of training a deep learning model on a large unlabelled corpus using self-supervised objectives, establishing general-purpose representations for downstream adaptation.
Pooling Layer
ArchitecturesA neural network layer that reduces spatial dimensions by aggregating values, commonly using max or average operations.
State Space Model
ArchitecturesA sequence modelling architecture based on continuous-time dynamical systems that processes long sequences with linear complexity, offering an alternative to attention-based transformers.
Fine-Tuning
Language ModelsThe process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.
Word Embedding
Language ModelsDense vector representations of words where semantically similar words are mapped to nearby points in vector space.
Weight Decay
ArchitecturesA regularisation technique that penalises large model weights during training by adding a fraction of the weight magnitude to the loss function, preventing overfitting.
Gradient Clipping
Training & OptimisationA technique that caps gradient values during training to prevent the exploding gradient problem.
Self-Attention
Training & OptimisationAn attention mechanism where each element in a sequence attends to all other elements to compute its representation.