Multi-Head Attention — Technology Wiki

Overview

Direct Answer

Multi-head attention is a neural network mechanism that applies multiple independent attention operations—each with different learned transformations—across the same input simultaneously, enabling the model to attend to different positional and semantic relationships in parallel.

How It Works

The mechanism divides the input into multiple subspaces through learned linear projections, applies scaled dot-product attention independently to each subspace, then concatenates and projects the results. This allows each head to specialise in different types of dependencies: some capturing syntactic relationships whilst others focus on semantic associations.

Why It Matters

Multi-head attention substantially improves model expressiveness and convergence speed compared to single-head variants, directly enhancing accuracy on sequence-to-sequence tasks without proportional increases in computational cost. Enterprise applications benefit from more robust natural language understanding and improved cross-domain transfer learning performance.

Common Applications

The mechanism is fundamental to transformer-based models used in machine translation systems, large language models, question-answering platforms, and document summarisation services. Speech recognition and protein structure prediction systems also rely on this architectural component.

Key Considerations

Practitioners must balance the number of heads against computational overhead and memory consumption; excessive heads yield diminishing accuracy returns. Interpretability of individual attention heads remains challenging, complicating debugging and validation in safety-critical applications.

Cross-References(1)

Deep Learning

Attention Mechanism

Referenced By1 term mentions Multi-Head Attention

Other entries in the wiki whose definition references Multi-Head Attention — useful for understanding how this concept connects across Deep Learning and adjacent domains.

Attention Head·Deep Learning

Related in Training & Optimisation

Self-Attention

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Residual Network

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Layer Normalisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Dropout

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Activation Function

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

ReLU

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Positional Encoding

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Gradient Clipping

A technique that caps gradient values during training to prevent the exploding gradient problem.

Mixed Precision Training

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Rotary Positional Encoding

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

More in Deep Learning

Pretraining

Architectures

Training a model on a large general dataset before fine-tuning it on a specific downstream task.

Fine-Tuning

Architectures

The process of taking a pretrained model and further training it on a smaller, task-specific dataset.

Pooling Layer

Architectures

A neural network layer that reduces spatial dimensions by aggregating values, commonly using max or average operations.

State Space Model

Architectures

A sequence modelling architecture based on continuous-time dynamical systems that processes long sequences with linear complexity, offering an alternative to attention-based transformers.

Graph Neural Network

Architectures

A neural network designed to operate on graph-structured data, learning representations of nodes, edges, and entire graphs.

Prefix Tuning

Language Models

A parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.

Weight Decay

Architectures

A regularisation technique that penalises large model weights during training by adding a fraction of the weight magnitude to the loss function, preventing overfitting.

Adapter Layers

Language Models

Small trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.