Key-Value Cache

Overview

Direct Answer

A memory optimisation technique in transformer-based models that caches previously computed key and value tensors during autoregressive generation, eliminating redundant recalculation as each new token is produced. This mechanism significantly reduces computational overhead during inference without altering model outputs.

How It Works

During token generation, the transformer computes queries for the current token whilst reusing cached key-value pairs from prior positions rather than recomputing them. The cache is sequentially extended as each new token is generated, allowing attention operations to access historical representations in constant rather than quadratic time relative to sequence length. Modern implementations store these tensors in GPU memory or system RAM, depending on batch size and model dimensions.

Why It Matters

Key-value caching reduces inference latency by 2–3× on typical sequence lengths, directly lowering operational costs for production language models and enabling real-time interactive applications. For resource-constrained environments and large-scale deployments, this optimisation determines practical feasibility of transformer inference at scale.

Common Applications

Used extensively in conversational AI systems, real-time code generation tools, and streaming text summarisation services. Dialogue systems relying on multi-turn interactions particularly benefit from avoiding reprocessing of prior conversation history.

Key Considerations

Cache memory consumption scales linearly with batch size and sequence length, creating practical limits on concurrency and maximum context window. Careful management is required to prevent memory exhaustion, and cache invalidation strategies vary across frameworks and hardware configurations.

Cross-References(2)

Deep Learning

Transformer

Blockchain & DLT

Token

Related in Architectures

Deep Learning

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Neural Network

A computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.

Convolutional Neural Network

A deep learning architecture designed for processing structured grid data like images, using convolutional filters to detect features.

Recurrent Neural Network

A neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.

Long Short-Term Memory

A recurrent neural network architecture designed to learn long-term dependencies by using gating mechanisms to control information flow.

Gated Recurrent Unit

A simplified variant of LSTM that combines the forget and input gates into a single update gate.

Transformer

A neural network architecture based entirely on attention mechanisms, eliminating recurrence and enabling parallel processing of sequences.

Attention Mechanism

A neural network component that learns to focus on relevant parts of the input when producing each element of the output.

Encoder-Decoder Architecture

A neural network design where an encoder processes input into a fixed representation and a decoder generates output from it.

Autoencoder

A neural network trained to encode input data into a compressed representation and then decode it back to reconstruct the original.

Variational Autoencoder

A generative model that learns a probabilistic latent space representation, enabling generation of new data samples.

Batch Normalisation

A technique that normalises layer inputs during training to stabilise and accelerate deep neural network learning.

More in Deep Learning

Residual Network

Training & Optimisation

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Pretraining

Architectures

Training a model on a large general dataset before fine-tuning it on a specific downstream task.

Dropout

Training & Optimisation

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Adapter Layers

Language Models

Small trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.

Vision Transformer

Architectures

A transformer architecture adapted for image recognition that divides images into patches and processes them as sequences, rivalling convolutional networks in visual tasks.

Positional Encoding

Training & Optimisation

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Representation Learning

Architectures

The automatic discovery of data representations needed for feature detection or classification from raw data.

Vanishing Gradient

Architectures

A problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.

Overview

Direct Answer

How It Works

Why It Matters

Common Applications

Key Considerations

Cross-References(2)

Related in Architectures

Deep Learning

Neural Network

Convolutional Neural Network

Recurrent Neural Network

Long Short-Term Memory

Gated Recurrent Unit

Transformer

Attention Mechanism

Encoder-Decoder Architecture

Autoencoder

Variational Autoencoder

Batch Normalisation

More in Deep Learning

Residual Network

Pretraining

Dropout

Adapter Layers

Vision Transformer

Positional Encoding

Representation Learning

Vanishing Gradient

See Also

Token