Overview
Direct Answer
An encoder-decoder architecture is a neural network framework in which an encoder network compresses variable-length input into a fixed-size context vector, and a decoder network reconstructs or generates output from that representation. This design enables processing of sequential data with different input and output lengths.
How It Works
The encoder processes input tokens sequentially through recurrent or transformer layers, extracting semantic meaning into a dense vector or sequence of hidden states. The decoder then uses this context representation as its initial state, generating output tokens one at a time through conditional probability distributions. Attention mechanisms often bridge encoder and decoder, allowing the decoder to focus selectively on relevant input regions during generation.
Why It Matters
This architecture fundamentally enables sequence-to-sequence tasks where input and output have mismatched structures, improving accuracy on translation, summarisation, and dialogue systems. Organisations benefit from unified handling of variable-length problems without task-specific feature engineering, reducing development time and operational complexity.
Common Applications
Applications include machine translation (translating between languages), automatic speech recognition (audio to text), image captioning (visual input to textual description), and abstractive summarisation. Medical transcription, customer support automation, and code generation systems rely on this approach.
Key Considerations
The fixed-size bottleneck in traditional designs can lose information from long sequences, mitigated by attention mechanisms and hierarchical encoders. Computational cost scales with sequence length; inference speed may constrain real-time applications.
Cross-References(1)
More in Deep Learning
Diffusion Model
Generative ModelsA generative model that learns to reverse a gradual noising process, generating high-quality samples from random noise.
Contrastive Learning
ArchitecturesA self-supervised learning approach that trains models by comparing similar and dissimilar pairs of data representations.
Adapter Layers
Language ModelsSmall trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.
Softmax Function
Training & OptimisationAn activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.
Fine-Tuning
Language ModelsThe process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.
Rotary Positional Encoding
Training & OptimisationA position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.
Attention Head
Training & OptimisationAn individual attention computation within a multi-head attention layer that learns to focus on different aspects of the input, with outputs concatenated for richer representations.
Flash Attention
ArchitecturesAn IO-aware attention algorithm that reduces memory reads and writes by tiling the attention computation, enabling faster training of long-context transformer models.