Machine LearningTraining Techniques

Adam Optimiser

Overview

Direct Answer

Adam (Adaptive Moment Estimation) is a first-order gradient-based optimisation algorithm that maintains per-parameter adaptive learning rates by computing exponential moving averages of both gradients and squared gradients. It combines the benefits of momentum-based methods with element-wise adaptive learning rate scaling, making it particularly effective for training deep neural networks with sparse or noisy gradients.

How It Works

The algorithm maintains two moment estimates for each parameter: the first moment (mean of gradients, analogous to momentum) and the second moment (mean of squared gradients, similar to RMSProp). At each iteration, these moving averages are updated using exponential decay rates, then bias-corrected to account for initialisation at zero. The parameter update is computed by dividing the first moment by the square root of the second moment plus a small epsilon term, producing an effective adaptive step size that varies per dimension.

Why It Matters

Adam optimiser has become the de facto standard for training deep learning models because it converges faster than vanilla stochastic gradient descent and requires minimal hyperparameter tuning. Its adaptive per-parameter learning rates reduce sensitivity to learning rate scheduling, lowering computational overhead and enabling faster experimentation cycles—critical factors in organisations developing large-scale machine learning systems where training time directly impacts cost and deployment velocity.

Common Applications

The optimiser is ubiquitously employed in computer vision tasks such as convolutional neural network training, natural language processing models including transformer-based architectures, and reinforcement learning agent training. It is the default choice across most deep learning frameworks and has become standard practice in both research and production environments across financial services, healthcare, and technology sectors.

Key Considerations

While computationally efficient, the algorithm requires additional memory to store moment estimates for each parameter, which can be prohibitive for extremely large models. The bias-correction mechanism is essential for convergence in early training iterations, and the method's performance remains sensitive to the exponential decay rates and epsilon hyperparameters in certain problem domains.

Cross-References(2)

Machine Learning
Deep Learning

More in Machine Learning

See Also