Overview
Direct Answer
Adam (Adaptive Moment Estimation) is a first-order gradient-based optimisation algorithm that maintains per-parameter adaptive learning rates by computing exponential moving averages of both gradients and squared gradients. It combines the benefits of momentum-based methods with element-wise adaptive learning rate scaling, making it particularly effective for training deep neural networks with sparse or noisy gradients.
How It Works
The algorithm maintains two moment estimates for each parameter: the first moment (mean of gradients, analogous to momentum) and the second moment (mean of squared gradients, similar to RMSProp). At each iteration, these moving averages are updated using exponential decay rates, then bias-corrected to account for initialisation at zero. The parameter update is computed by dividing the first moment by the square root of the second moment plus a small epsilon term, producing an effective adaptive step size that varies per dimension.
Why It Matters
Adam optimiser has become the de facto standard for training deep learning models because it converges faster than vanilla stochastic gradient descent and requires minimal hyperparameter tuning. Its adaptive per-parameter learning rates reduce sensitivity to learning rate scheduling, lowering computational overhead and enabling faster experimentation cycles—critical factors in organisations developing large-scale machine learning systems where training time directly impacts cost and deployment velocity.
Common Applications
The optimiser is ubiquitously employed in computer vision tasks such as convolutional neural network training, natural language processing models including transformer-based architectures, and reinforcement learning agent training. It is the default choice across most deep learning frameworks and has become standard practice in both research and production environments across financial services, healthcare, and technology sectors.
Key Considerations
While computationally efficient, the algorithm requires additional memory to store moment estimates for each parameter, which can be prohibitive for extremely large models. The bias-correction mechanism is essential for convergence in early training iterations, and the method's performance remains sensitive to the exponential decay rates and epsilon hyperparameters in certain problem domains.
Cross-References(2)
More in Machine Learning
Lasso Regression
Feature Engineering & SelectionA regularised regression technique that adds an L1 penalty, enabling feature selection by driving some coefficients to zero.
A/B Testing
Training TechniquesA controlled experiment comparing two variants to determine which performs better against a defined metric.
Multi-Task Learning
MLOps & ProductionA machine learning approach where a model is simultaneously trained on multiple related tasks to improve generalisation.
Active Learning
MLOps & ProductionA machine learning approach where the algorithm interactively queries a user or oracle to label new data points.
Random Forest
Supervised LearningAn ensemble learning method that constructs multiple decision trees during training and outputs the mode of their predictions.
Boosting
Supervised LearningAn ensemble technique that sequentially trains models, each focusing on correcting the errors of previous models.
Logistic Regression
Supervised LearningA classification algorithm that models the probability of a binary outcome using a logistic function.
Principal Component Analysis
Unsupervised LearningA dimensionality reduction technique that transforms data into orthogonal components ordered by the amount of variance they explain.