Overview
Direct Answer
Long Short-Term Memory (LSTM) is a specialised recurrent neural network architecture that addresses the vanishing gradient problem by employing gating mechanisms—input, forget, and output gates—to selectively retain or discard information across extended sequences. This design enables the network to capture dependencies spanning hundreds or thousands of time steps, a capability essential for tasks requiring long-range contextual understanding.
How It Works
LSTMs maintain a cell state that acts as a memory conduit, with three gate structures regulating information flow. The forget gate determines what information to discard from the previous cell state, the input gate controls new information entry, and the output gate decides what cell state information becomes the next hidden state. This gating mechanism prevents gradients from vanishing or exploding during backpropagation through time, enabling stable learning across sequences.
Why It Matters
Organisations rely on LSTMs for applications demanding accurate temporal pattern recognition where traditional feedforward networks fail. Superior performance on sequence-to-sequence tasks directly reduces training time, improves model accuracy on language and time-series problems, and decreases computational overhead compared to alternative architectures managing long dependencies.
Common Applications
LSTMs power machine translation systems, speech recognition engines, and financial time-series forecasting. Natural language processing tasks including sentiment analysis, named entity recognition, and text generation depend heavily on this architecture. Stock price prediction, sensor anomaly detection, and video action recognition leverage LSTMs' ability to model temporal relationships.
Key Considerations
Training complexity and computational cost increase substantially with sequence length, and LSTMs remain more expensive than transformer-based alternatives for many modern applications. Hyperparameter tuning—particularly layer depth, hidden unit count, and dropout rates—significantly influences performance, requiring careful experimentation.
Cross-References(2)
More in Deep Learning
Word Embedding
Language ModelsDense vector representations of words where semantically similar words are mapped to nearby points in vector space.
Sigmoid Function
Training & OptimisationAn activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.
Gradient Checkpointing
ArchitecturesA memory optimisation that trades computation for memory by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.
Activation Function
Training & OptimisationA mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.
Diffusion Model
Generative ModelsA generative model that learns to reverse a gradual noising process, generating high-quality samples from random noise.
Exploding Gradient
ArchitecturesA problem where gradients grow exponentially during backpropagation, causing unstable weight updates and training failure.
Pooling Layer
ArchitecturesA neural network layer that reduces spatial dimensions by aggregating values, commonly using max or average operations.
Key-Value Cache
ArchitecturesAn optimisation in autoregressive transformer inference that stores previously computed key and value tensors to avoid redundant computation during sequential token generation.