Self-Attention — Technology Wiki

Overview

Direct Answer

Self-attention is a neural network mechanism that allows each position in a sequence to compute a weighted representation by attending to all other positions, including itself. It enables the model to dynamically learn which parts of the input are most relevant for processing each element, without relying on positional proximity or recurrence.

How It Works

The mechanism operates through three learnable projections—query, key, and value—that transform input sequences into corresponding representations. For each position, the query is compared against all keys using a scaled dot-product operation to produce attention weights, which are then applied to the values to create context-aware output vectors. This computation occurs in parallel across all sequence positions.

Why It Matters

Self-attention underpins transformer architectures that have become foundational to large language models and multimodal systems, delivering superior performance on sequential tasks whilst enabling efficient parallelisation during training. Organisations benefit from dramatically improved accuracy on language understanding, translation, and generation tasks, along with reduced computational overhead compared to recurrent alternatives for inference at scale.

Common Applications

This mechanism powers natural language processing applications including machine translation, text classification, and question-answering systems. It is also integral to vision transformers for image classification, multimodal models for cross-modal alignment, and time-series forecasting in financial and IoT contexts.

Key Considerations

Computational complexity scales quadratically with sequence length, creating bottlenecks for very long documents or high-resolution images. Attention patterns can also be difficult to interpret, and the mechanism requires sufficient training data to learn meaningful alignment patterns effectively.

Cross-References(1)

Deep Learning

Attention Mechanism

Related in Training & Optimisation

Multi-Head Attention

An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.

Residual Network

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Layer Normalisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Dropout

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Activation Function

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

ReLU

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Positional Encoding

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Gradient Clipping

A technique that caps gradient values during training to prevent the exploding gradient problem.

Mixed Precision Training

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Rotary Positional Encoding

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

More in Deep Learning

Weight Initialisation

Architectures

The strategy for setting initial parameter values in a neural network before training begins.

Recurrent Neural Network

Architectures

A neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.

Flash Attention

Architectures

An IO-aware attention algorithm that reduces memory reads and writes by tiling the attention computation, enabling faster training of long-context transformer models.

Contrastive Learning

Architectures

A self-supervised learning approach that trains models by comparing similar and dissimilar pairs of data representations.

Prefix Tuning

Language Models

A parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.

Neural Network

Architectures

A computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.

Graph Neural Network

Architectures

A neural network designed to operate on graph-structured data, learning representations of nodes, edges, and entire graphs.

Variational Autoencoder

Architectures

A generative model that learns a probabilistic latent space representation, enabling generation of new data samples.