Long Short-Term Memory — Technology Wiki

Overview

Direct Answer

Long Short-Term Memory (LSTM) is a specialised recurrent neural network architecture that addresses the vanishing gradient problem by employing gating mechanisms—input, forget, and output gates—to selectively retain or discard information across extended sequences. This design enables the network to capture dependencies spanning hundreds or thousands of time steps, a capability essential for tasks requiring long-range contextual understanding.

How It Works

LSTMs maintain a cell state that acts as a memory conduit, with three gate structures regulating information flow. The forget gate determines what information to discard from the previous cell state, the input gate controls new information entry, and the output gate decides what cell state information becomes the next hidden state. This gating mechanism prevents gradients from vanishing or exploding during backpropagation through time, enabling stable learning across sequences.

Why It Matters

Organisations rely on LSTMs for applications demanding accurate temporal pattern recognition where traditional feedforward networks fail. Superior performance on sequence-to-sequence tasks directly reduces training time, improves model accuracy on language and time-series problems, and decreases computational overhead compared to alternative architectures managing long dependencies.

Common Applications

LSTMs power machine translation systems, speech recognition engines, and financial time-series forecasting. Natural language processing tasks including sentiment analysis, named entity recognition, and text generation depend heavily on this architecture. Stock price prediction, sensor anomaly detection, and video action recognition leverage LSTMs' ability to model temporal relationships.

Key Considerations

Training complexity and computational cost increase substantially with sequence length, and LSTMs remain more expensive than transformer-based alternatives for many modern applications. Hyperparameter tuning—particularly layer depth, hidden unit count, and dropout rates—significantly influences performance, requiring careful experimentation.

Cross-References(2)

Deep Learning

Recurrent Neural Network Neural Network

Related in Architectures

Deep Learning

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Neural Network

A computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.

Convolutional Neural Network

A deep learning architecture designed for processing structured grid data like images, using convolutional filters to detect features.

Recurrent Neural Network

A neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.

Gated Recurrent Unit

A simplified variant of LSTM that combines the forget and input gates into a single update gate.

Transformer

A neural network architecture based entirely on attention mechanisms, eliminating recurrence and enabling parallel processing of sequences.

Attention Mechanism

A neural network component that learns to focus on relevant parts of the input when producing each element of the output.

Encoder-Decoder Architecture

A neural network design where an encoder processes input into a fixed representation and a decoder generates output from it.

Autoencoder

A neural network trained to encode input data into a compressed representation and then decode it back to reconstruct the original.

Variational Autoencoder

A generative model that learns a probabilistic latent space representation, enabling generation of new data samples.

Batch Normalisation

A technique that normalises layer inputs during training to stabilise and accelerate deep neural network learning.

Embedding

A learned dense vector representation of discrete data (like words or categories) in a continuous vector space.

More in Deep Learning

Fine-Tuning

Language Models

The process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.

Representation Learning

Architectures

The automatic discovery of data representations needed for feature detection or classification from raw data.

Word Embedding

Language Models

Dense vector representations of words where semantically similar words are mapped to nearby points in vector space.

State Space Model

Architectures

A sequence modelling architecture based on continuous-time dynamical systems that processes long sequences with linear complexity, offering an alternative to attention-based transformers.

Dropout

Training & Optimisation

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Pretraining

Architectures

Training a model on a large general dataset before fine-tuning it on a specific downstream task.

Diffusion Model

Generative Models

A generative model that learns to reverse a gradual noising process, generating high-quality samples from random noise.

Mixture of Experts

Architectures

An architecture where different specialised sub-networks (experts) are selectively activated based on the input.