Active Learning

Overview

Direct Answer

Active learning is a machine learning paradigm in which an algorithm selectively queries an oracle (typically a human annotator) to label the most informative unlabelled data points, rather than passively consuming a pre-labelled dataset. This approach reduces annotation effort whilst maintaining or improving model performance.

How It Works

The algorithm trains on an initial small labelled set, then iteratively identifies which unlabelled samples would provide the greatest reduction in model uncertainty or error if annotated. Selection strategies include uncertainty sampling (highest entropy predictions), query-by-committee (disagreement among ensemble members), and expected model change. The newly labelled samples are incorporated into the training set, and the process repeats until a stopping criterion is met.

Why It Matters

Organisations face significant costs when acquiring expert labels, particularly in domains requiring domain-specific knowledge such as medical imaging, compliance review, or scientific research. Active learning can reduce labelling costs by 50–80 per cent relative to random sampling whilst achieving equivalent model accuracy, accelerating deployment timelines and reducing expenses for resource-constrained teams.

Common Applications

Applications include medical diagnosis systems where radiologist annotations are expensive, sentiment analysis in low-resource languages, anomaly detection in security systems, and biological sequence classification. Legal technology firms employ active learning to optimise document review workflows by prioritising uncertain cases for human review.

Key Considerations

The effectiveness of active learning depends heavily on the quality of the selection strategy and the availability of reliable oracles; poor query design can waste annotations. Additionally, active learning introduces complexity in model validation and may exhibit suboptimal performance in highly imbalanced datasets or when the initial sample is unrepresentative.

Cross-References(2)

Machine Learning

Blockchain & DLT

Oracle

Related in MLOps & Production

Machine Learning

A subset of AI that enables systems to automatically learn and improve from experience without being explicitly programmed.

Supervised Learning

A machine learning paradigm where models are trained on labelled data, learning to map inputs to known outputs.

Unsupervised Learning

A machine learning approach where models discover patterns and structures in data without labelled examples.

Reinforcement Learning

A machine learning paradigm where agents learn optimal behaviour through trial and error, receiving rewards or penalties.

Multi-Task Learning

A machine learning approach where a model is simultaneously trained on multiple related tasks to improve generalisation.

Online Learning

A machine learning method where models are incrementally updated as new data arrives, rather than being trained in batch.

Batch Learning

Training a machine learning model on the entire dataset at once before deployment, as opposed to incremental updates.

Ensemble Learning

Combining multiple machine learning models to produce better predictive performance than any single model.

Feature Selection

The process of identifying and selecting the most relevant input variables for a machine learning model.

Epoch

One complete pass through the entire training dataset during the machine learning model training process.

Model Serialisation

The process of converting a trained model into a format that can be stored, transferred, and later reconstructed for inference.

Model Serving

The infrastructure and processes for deploying trained machine learning models to production environments for real-time predictions.

More in Machine Learning

Class Imbalance

Feature Engineering & Selection

A situation where the distribution of classes in a dataset is significantly skewed, with some classes vastly outnumbering others.

Bias-Variance Tradeoff

Training Techniques

The balance between a model's ability to minimise bias (error from assumptions) and variance (sensitivity to training data fluctuations).

Polynomial Regression

Supervised Learning

A form of regression analysis where the relationship between variables is modelled as an nth degree polynomial.

Gradient Boosting

Supervised Learning

An ensemble technique that builds models sequentially, with each new model correcting residual errors of the combined ensemble.

Naive Bayes

Supervised Learning

A probabilistic classifier based on applying Bayes' theorem with the assumption of independence between features.

Ridge Regression

Training Techniques

A regularised regression technique that adds an L2 penalty term to prevent overfitting by constraining coefficient magnitudes.

Bagging

Advanced Methods

Bootstrap Aggregating — an ensemble method that trains multiple models on random subsets of data and averages their predictions.

t-SNE

Unsupervised Learning

t-Distributed Stochastic Neighbour Embedding — a technique for visualising high-dimensional data in two or three dimensions.

Overview

Direct Answer

How It Works

Why It Matters

Common Applications

Key Considerations

Cross-References(2)

Related in MLOps & Production

Machine Learning

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Multi-Task Learning

Online Learning

Batch Learning

Ensemble Learning

Feature Selection

Epoch

Model Serialisation

Model Serving

More in Machine Learning

Class Imbalance

Bias-Variance Tradeoff

Polynomial Regression

Gradient Boosting

Naive Bayes

Ridge Regression

Bagging

t-SNE

See Also

Oracle