Overview
Direct Answer
Multi-head attention is a neural network mechanism that applies multiple independent attention operations—each with different learned transformations—across the same input simultaneously, enabling the model to attend to different positional and semantic relationships in parallel.
How It Works
The mechanism divides the input into multiple subspaces through learned linear projections, applies scaled dot-product attention independently to each subspace, then concatenates and projects the results. This allows each head to specialise in different types of dependencies: some capturing syntactic relationships whilst others focus on semantic associations.
Why It Matters
Multi-head attention substantially improves model expressiveness and convergence speed compared to single-head variants, directly enhancing accuracy on sequence-to-sequence tasks without proportional increases in computational cost. Enterprise applications benefit from more robust natural language understanding and improved cross-domain transfer learning performance.
Common Applications
The mechanism is fundamental to transformer-based models used in machine translation systems, large language models, question-answering platforms, and document summarisation services. Speech recognition and protein structure prediction systems also rely on this architectural component.
Key Considerations
Practitioners must balance the number of heads against computational overhead and memory consumption; excessive heads yield diminishing accuracy returns. Interpretability of individual attention heads remains challenging, complicating debugging and validation in safety-critical applications.
Cross-References(1)
Referenced By1 term mentions Multi-Head Attention
Other entries in the wiki whose definition references Multi-Head Attention — useful for understanding how this concept connects across Deep Learning and adjacent domains.
More in Deep Learning
Adapter Layers
Language ModelsSmall trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.
Model Parallelism
ArchitecturesA distributed training approach that partitions a model across multiple devices, enabling training of models too large to fit in a single accelerator's memory.
Neural Network
ArchitecturesA computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.
Deep Learning
ArchitecturesA subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.
Weight Initialisation
ArchitecturesThe strategy for setting initial parameter values in a neural network before training begins.
Fine-Tuning
Language ModelsThe process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.
Data Parallelism
ArchitecturesA distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.
Gradient Checkpointing
ArchitecturesA memory optimisation that trades computation for memory by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.