Adam Optimiser

Overview

Direct Answer

Adam (Adaptive Moment Estimation) is a first-order gradient-based optimisation algorithm that maintains per-parameter adaptive learning rates by computing exponential moving averages of both gradients and squared gradients. It combines the benefits of momentum-based methods with element-wise adaptive learning rate scaling, making it particularly effective for training deep neural networks with sparse or noisy gradients.

How It Works

The algorithm maintains two moment estimates for each parameter: the first moment (mean of gradients, analogous to momentum) and the second moment (mean of squared gradients, similar to RMSProp). At each iteration, these moving averages are updated using exponential decay rates, then bias-corrected to account for initialisation at zero. The parameter update is computed by dividing the first moment by the square root of the second moment plus a small epsilon term, producing an effective adaptive step size that varies per dimension.

Why It Matters

Adam optimiser has become the de facto standard for training deep learning models because it converges faster than vanilla stochastic gradient descent and requires minimal hyperparameter tuning. Its adaptive per-parameter learning rates reduce sensitivity to learning rate scheduling, lowering computational overhead and enabling faster experimentation cycles—critical factors in organisations developing large-scale machine learning systems where training time directly impacts cost and deployment velocity.

Common Applications

The optimiser is ubiquitously employed in computer vision tasks such as convolutional neural network training, natural language processing models including transformer-based architectures, and reinforcement learning agent training. It is the default choice across most deep learning frameworks and has become standard practice in both research and production environments across financial services, healthcare, and technology sectors.

Key Considerations

While computationally efficient, the algorithm requires additional memory to store moment estimates for each parameter, which can be prohibitive for extremely large models. The bias-correction mechanism is essential for convergence in early training iterations, and the method's performance remains sensitive to the exponential decay rates and epsilon hyperparameters in certain problem domains.

Cross-References(2)

Machine Learning

Learning Rate

Deep Learning

Related in Training Techniques

Ridge Regression

A regularised regression technique that adds an L2 penalty term to prevent overfitting by constraining coefficient magnitudes.

Elastic Net

A regularisation technique combining L1 and L2 penalties, balancing feature selection and coefficient shrinkage.

Cross-Validation

A resampling technique that partitions data into subsets, training on some and validating on others to assess model generalisation.

Overfitting

When a model learns the training data too well, including noise, resulting in poor performance on unseen data.

Underfitting

When a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test data.

Bias-Variance Tradeoff

The balance between a model's ability to minimise bias (error from assumptions) and variance (sensitivity to training data fluctuations).

Regularisation

Techniques that add constraints or penalties to a model to prevent overfitting and improve generalisation to new data.

Gradient Descent

An optimisation algorithm that iteratively adjusts parameters in the direction of steepest descent of the loss function.

Stochastic Gradient Descent

A variant of gradient descent that updates parameters using a randomly selected subset of training data each iteration.

Learning Rate

A hyperparameter that controls how much model parameters are adjusted with respect to the loss gradient during training.

Loss Function

A mathematical function that measures the difference between predicted outputs and actual target values during model training.

Backpropagation

The algorithm for computing gradients of the loss function with respect to network weights, enabling neural network training.

More in Machine Learning

UMAP

Unsupervised Learning

Uniform Manifold Approximation and Projection — a dimensionality reduction technique for visualisation and general non-linear reduction.

Feature Store

MLOps & Production

A centralised repository for storing, managing, and serving machine learning features, ensuring consistency between training and inference environments across an organisation.

Markov Decision Process

Reinforcement Learning

A mathematical framework for modelling sequential decision-making where outcomes are partly random and partly controlled.

Mini-Batch

Training Techniques

A subset of the training data used to compute a gradient update during stochastic gradient descent.

Boosting

Supervised Learning

An ensemble technique that sequentially trains models, each focusing on correcting the errors of previous models.

Logistic Regression

Supervised Learning

A classification algorithm that models the probability of a binary outcome using a logistic function.

Machine Learning

MLOps & Production

A subset of AI that enables systems to automatically learn and improve from experience without being explicitly programmed.

Feature Selection

MLOps & Production

The process of identifying and selecting the most relevant input variables for a machine learning model.

Overview

Direct Answer

How It Works

Why It Matters

Common Applications

Key Considerations

Cross-References(2)

Related in Training Techniques

Ridge Regression

Elastic Net

Cross-Validation

Overfitting

Underfitting

Bias-Variance Tradeoff

Regularisation

Gradient Descent

Stochastic Gradient Descent

Learning Rate

Loss Function

Backpropagation

More in Machine Learning

UMAP

Feature Store

Markov Decision Process

Mini-Batch

Boosting

Logistic Regression

Machine Learning

Feature Selection

See Also

Deep Learning