Overview
Direct Answer
A memory optimisation technique in transformer-based models that caches previously computed key and value tensors during autoregressive generation, eliminating redundant recalculation as each new token is produced. This mechanism significantly reduces computational overhead during inference without altering model outputs.
How It Works
During token generation, the transformer computes queries for the current token whilst reusing cached key-value pairs from prior positions rather than recomputing them. The cache is sequentially extended as each new token is generated, allowing attention operations to access historical representations in constant rather than quadratic time relative to sequence length. Modern implementations store these tensors in GPU memory or system RAM, depending on batch size and model dimensions.
Why It Matters
Key-value caching reduces inference latency by 2–3× on typical sequence lengths, directly lowering operational costs for production language models and enabling real-time interactive applications. For resource-constrained environments and large-scale deployments, this optimisation determines practical feasibility of transformer inference at scale.
Common Applications
Used extensively in conversational AI systems, real-time code generation tools, and streaming text summarisation services. Dialogue systems relying on multi-turn interactions particularly benefit from avoiding reprocessing of prior conversation history.
Key Considerations
Cache memory consumption scales linearly with batch size and sequence length, creating practical limits on concurrency and maximum context window. Careful management is required to prevent memory exhaustion, and cache invalidation strategies vary across frameworks and hardware configurations.
Cross-References(2)
More in Deep Learning
Rotary Positional Encoding
Training & OptimisationA position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.
Multi-Head Attention
Training & OptimisationAn attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.
Pretraining
ArchitecturesTraining a model on a large general dataset before fine-tuning it on a specific downstream task.
Fine-Tuning
ArchitecturesThe process of taking a pretrained model and further training it on a smaller, task-specific dataset.
Parameter-Efficient Fine-Tuning
Language ModelsMethods for adapting large pretrained models to new tasks by only updating a small fraction of their parameters.
Convolutional Layer
ArchitecturesA neural network layer that applies learnable filters across input data to detect local patterns and features.
Capsule Network
ArchitecturesA neural network architecture that groups neurons into capsules to better capture spatial hierarchies and part-whole relationships.
LoRA
Language ModelsLow-Rank Adaptation — a parameter-efficient fine-tuning technique that adds trainable low-rank matrices to frozen pretrained weights.