Overview
Direct Answer
Stochastic Gradient Descent (SGD) is an optimisation algorithm that updates model parameters using the gradient computed from a single training example or small batch, rather than the entire dataset. This probabilistic approach to parameter adjustment trades some convergence certainty for computational efficiency and faster iteration cycles.
How It Works
At each iteration, SGD samples a single instance or mini-batch randomly from the training set, computes the loss gradient with respect to that sample, and adjusts parameters in the direction opposite to the gradient by a step size called the learning rate. The stochastic nature—randomness in sample selection—introduces noise into the parameter trajectory, which can help escape local minima and reduce memory requirements compared to full-batch methods.
Why It Matters
SGD enables training on datasets too large to fit in memory and reduces wall-clock time per iteration significantly, making it essential for modern deep learning at scale. The noise-induced exploration properties often lead to better generalisation on unseen data, whilst the reduced computational footprint per step allows practitioners to iterate on model design rapidly.
Common Applications
SGD is the foundation for training neural networks across computer vision, natural language processing, and recommendation systems. It underpins backpropagation in deep learning frameworks and remains standard in federated learning environments where data partitioning across devices necessitates sample-wise or batch-wise updates.
Key Considerations
The learning rate becomes critical since constant steps with noisy gradients risk divergence; adaptive variants like Adam and RMSprop address this by adjusting step sizes per parameter. Convergence guarantees weaken compared to batch gradient descent, and practitioners must balance batch size, learning rate scheduling, and epoch count empirically.
Cross-References(1)
Referenced By1 term mentions Stochastic Gradient Descent
Other entries in the wiki whose definition references Stochastic Gradient Descent — useful for understanding how this concept connects across Machine Learning and adjacent domains.
More in Machine Learning
Ensemble Methods
MLOps & ProductionMachine learning techniques that combine multiple models to produce better predictive performance than any single model, including bagging, boosting, and stacking approaches.
DBSCAN
Unsupervised LearningDensity-Based Spatial Clustering of Applications with Noise — a clustering algorithm that finds arbitrarily shaped clusters based on density.
SHAP Values
MLOps & ProductionA game-theoretic approach to explaining individual model predictions by computing each feature's marginal contribution, based on Shapley values from cooperative game theory.
Model Registry
MLOps & ProductionA versioned catalogue of trained machine learning models with metadata, lineage, and approval workflows, enabling reproducible deployment and governance at enterprise scale.
Collaborative Filtering
Unsupervised LearningA recommendation technique that makes predictions based on the collective preferences and behaviour of many users.
XGBoost
Supervised LearningAn optimised distributed gradient boosting library designed for speed and performance in machine learning competitions and production.
Self-Supervised Learning
Advanced MethodsA learning paradigm where models generate their own supervisory signals from unlabelled data through pretext tasks.
Mini-Batch
Training TechniquesA subset of the training data used to compute a gradient update during stochastic gradient descent.