Overview
Direct Answer
An attention head is an individual computational unit within a multi-head attention mechanism that applies learned queries, keys, and values to compute weighted relevance scores across input sequences. Each head independently learns to attend to different positional and semantic relationships, with outputs concatenated to form richer contextual representations.
How It Works
Each attention head performs scaled dot-product attention by computing compatibility scores between query vectors and key vectors, normalising these scores via softmax, and using them to weight value vectors. Multiple heads operate in parallel with separate learned parameter matrices, allowing the model to simultaneously capture syntactic patterns, long-range dependencies, and semantic features from different representation subspaces.
Why It Matters
Multiple independent heads improve model capacity and interpretability whilst maintaining computational efficiency through parallelisation. This architecture has become foundational for state-of-the-art performance in language understanding, translation, and sequence modelling tasks, directly impacting accuracy and convergence speed in production NLP systems.
Common Applications
Attention heads are integral to transformer models used in machine translation systems, large language models for text generation, and multimodal systems combining vision and language. They enable models like BERT and GPT to achieve superior performance on classification, question-answering, and summarisation tasks across enterprise applications.
Key Considerations
Practitioners must balance the number of heads against computational cost and memory requirements; too few heads may limit representational capacity whilst excessive heads introduce redundancy without proportional performance gains. Attention patterns across heads often show correlation, suggesting some redundancy is inherent to the design.
Cross-References(1)
More in Deep Learning
Diffusion Model
Generative ModelsA generative model that learns to reverse a gradual noising process, generating high-quality samples from random noise.
Word Embedding
Language ModelsDense vector representations of words where semantically similar words are mapped to nearby points in vector space.
LoRA
Language ModelsLow-Rank Adaptation — a parameter-efficient fine-tuning technique that adds trainable low-rank matrices to frozen pretrained weights.
Knowledge Distillation
ArchitecturesA model compression technique where a smaller student model learns to mimic the behaviour of a larger teacher model.
State Space Model
ArchitecturesA sequence modelling architecture based on continuous-time dynamical systems that processes long sequences with linear complexity, offering an alternative to attention-based transformers.
Capsule Network
ArchitecturesA neural network architecture that groups neurons into capsules to better capture spatial hierarchies and part-whole relationships.
Contrastive Learning
ArchitecturesA self-supervised learning approach that trains models by comparing similar and dissimilar pairs of data representations.
Embedding
ArchitecturesA learned dense vector representation of discrete data (like words or categories) in a continuous vector space.