Stochastic Gradient Descent — Technology Wiki

Overview

Direct Answer

Stochastic Gradient Descent (SGD) is an optimisation algorithm that updates model parameters using the gradient computed from a single training example or small batch, rather than the entire dataset. This probabilistic approach to parameter adjustment trades some convergence certainty for computational efficiency and faster iteration cycles.

How It Works

At each iteration, SGD samples a single instance or mini-batch randomly from the training set, computes the loss gradient with respect to that sample, and adjusts parameters in the direction opposite to the gradient by a step size called the learning rate. The stochastic nature—randomness in sample selection—introduces noise into the parameter trajectory, which can help escape local minima and reduce memory requirements compared to full-batch methods.

Why It Matters

SGD enables training on datasets too large to fit in memory and reduces wall-clock time per iteration significantly, making it essential for modern deep learning at scale. The noise-induced exploration properties often lead to better generalisation on unseen data, whilst the reduced computational footprint per step allows practitioners to iterate on model design rapidly.

Common Applications

SGD is the foundation for training neural networks across computer vision, natural language processing, and recommendation systems. It underpins backpropagation in deep learning frameworks and remains standard in federated learning environments where data partitioning across devices necessitates sample-wise or batch-wise updates.

Key Considerations

The learning rate becomes critical since constant steps with noisy gradients risk divergence; adaptive variants like Adam and RMSprop address this by adjusting step sizes per parameter. Convergence guarantees weaken compared to batch gradient descent, and practitioners must balance batch size, learning rate scheduling, and epoch count empirically.

Cross-References(1)

Machine Learning

Gradient Descent

Referenced By1 term mentions Stochastic Gradient Descent

Other entries in the wiki whose definition references Stochastic Gradient Descent — useful for understanding how this concept connects across Machine Learning and adjacent domains.

Mini-Batch·Machine Learning

Related in Training Techniques

Ridge Regression

A regularised regression technique that adds an L2 penalty term to prevent overfitting by constraining coefficient magnitudes.

Elastic Net

A regularisation technique combining L1 and L2 penalties, balancing feature selection and coefficient shrinkage.

Cross-Validation

A resampling technique that partitions data into subsets, training on some and validating on others to assess model generalisation.

Overfitting

When a model learns the training data too well, including noise, resulting in poor performance on unseen data.

Underfitting

When a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test data.

Bias-Variance Tradeoff

The balance between a model's ability to minimise bias (error from assumptions) and variance (sensitivity to training data fluctuations).

Regularisation

Techniques that add constraints or penalties to a model to prevent overfitting and improve generalisation to new data.

Gradient Descent

An optimisation algorithm that iteratively adjusts parameters in the direction of steepest descent of the loss function.

Adam Optimiser

An adaptive learning rate optimisation algorithm combining momentum and RMSProp for efficient deep learning training.

Learning Rate

A hyperparameter that controls how much model parameters are adjusted with respect to the loss gradient during training.

Loss Function

A mathematical function that measures the difference between predicted outputs and actual target values during model training.

Backpropagation

The algorithm for computing gradients of the loss function with respect to network weights, enabling neural network training.

More in Machine Learning

Gradient Boosting

Supervised Learning

An ensemble technique that builds models sequentially, with each new model correcting residual errors of the combined ensemble.

Self-Supervised Learning

Advanced Methods

A learning paradigm where models generate their own supervisory signals from unlabelled data through pretext tasks.

Supervised Learning

MLOps & Production

A machine learning paradigm where models are trained on labelled data, learning to map inputs to known outputs.

t-SNE

Unsupervised Learning

t-Distributed Stochastic Neighbour Embedding — a technique for visualising high-dimensional data in two or three dimensions.

Naive Bayes

Supervised Learning

A probabilistic classifier based on applying Bayes' theorem with the assumption of independence between features.

UMAP

Unsupervised Learning

Uniform Manifold Approximation and Projection — a dimensionality reduction technique for visualisation and general non-linear reduction.

SMOTE

Feature Engineering & Selection

Synthetic Minority Over-sampling Technique — a method for addressing class imbalance by generating synthetic examples of the minority class.

Random Forest

Supervised Learning

An ensemble learning method that constructs multiple decision trees during training and outputs the mode of their predictions.