SMOTE — Technology Wiki

Overview

Direct Answer

SMOTE is a data preprocessing technique that addresses class imbalance by generating synthetic training examples in the feature space of the minority class, rather than simply duplicating existing minority instances. It uses k-nearest neighbours to create new synthetic samples along the line segments connecting minority class examples.

How It Works

The algorithm identifies minority class samples and, for each one, locates its k-nearest neighbours (typically k=5) within the same class. New synthetic samples are then generated by randomly interpolating between a minority instance and one of its selected neighbours, positioning them at random points along the connecting line in feature space. This process is repeated until the desired balance ratio is achieved.

Why It Matters

Class imbalance severely degrades classifier performance on minority classes, leading to poor recall and F1-scores in critical domains such as fraud detection, disease diagnosis, and anomaly identification. By synthesising rather than replicating examples, the technique increases effective training set size whilst enabling classifiers to learn decision boundaries more effectively without overfitting to genuine minority patterns.

Common Applications

Applications include credit card fraud detection, medical diagnosis with rare diseases, network intrusion detection, and manufacturing defect identification. Telecommunications and banking sectors regularly employ the technique to improve detection of rare but costly adverse events.

Key Considerations

The method assumes minority class samples are sufficiently dense to form meaningful neighbourhoods; sparse or highly scattered minority data may produce poor-quality synthetics. Generated samples exist in interpolated regions that may not reflect true underlying data distribution, and parameter tuning (particularly k and over-sampling ratio) significantly influences results.

Cross-References(1)

Machine Learning

Class Imbalance

Related in Feature Engineering & Selection

Lasso Regression

A regularised regression technique that adds an L1 penalty, enabling feature selection by driving some coefficients to zero.

Feature Engineering

The process of using domain knowledge to create, select, and transform input variables to improve model performance.

Data Augmentation

Techniques that artificially increase the size and diversity of training data through transformations like rotation, flipping, and cropping.

Class Imbalance

A situation where the distribution of classes in a dataset is significantly skewed, with some classes vastly outnumbering others.

Label Noise

Errors or inconsistencies in the annotations of training data that can degrade model performance and lead to unreliable predictions if not properly addressed.

More in Machine Learning

UMAP

Unsupervised Learning

Uniform Manifold Approximation and Projection — a dimensionality reduction technique for visualisation and general non-linear reduction.

Dimensionality Reduction

Unsupervised Learning

Techniques that reduce the number of input variables in a dataset while preserving essential information and structure.

Reinforcement Learning

MLOps & Production

A machine learning paradigm where agents learn optimal behaviour through trial and error, receiving rewards or penalties.

Markov Decision Process

Reinforcement Learning

A mathematical framework for modelling sequential decision-making where outcomes are partly random and partly controlled.

t-SNE

Unsupervised Learning

t-Distributed Stochastic Neighbour Embedding — a technique for visualising high-dimensional data in two or three dimensions.

Loss Function

Training Techniques

A mathematical function that measures the difference between predicted outputs and actual target values during model training.

Model Serving

MLOps & Production

The infrastructure and processes for deploying trained machine learning models to production environments for real-time predictions.

Ridge Regression

Training Techniques

A regularised regression technique that adds an L2 penalty term to prevent overfitting by constraining coefficient magnitudes.