Machine LearningFeature Engineering & Selection

SMOTE

Overview

Direct Answer

SMOTE is a data preprocessing technique that addresses class imbalance by generating synthetic training examples in the feature space of the minority class, rather than simply duplicating existing minority instances. It uses k-nearest neighbours to create new synthetic samples along the line segments connecting minority class examples.

How It Works

The algorithm identifies minority class samples and, for each one, locates its k-nearest neighbours (typically k=5) within the same class. New synthetic samples are then generated by randomly interpolating between a minority instance and one of its selected neighbours, positioning them at random points along the connecting line in feature space. This process is repeated until the desired balance ratio is achieved.

Why It Matters

Class imbalance severely degrades classifier performance on minority classes, leading to poor recall and F1-scores in critical domains such as fraud detection, disease diagnosis, and anomaly identification. By synthesising rather than replicating examples, the technique increases effective training set size whilst enabling classifiers to learn decision boundaries more effectively without overfitting to genuine minority patterns.

Common Applications

Applications include credit card fraud detection, medical diagnosis with rare diseases, network intrusion detection, and manufacturing defect identification. Telecommunications and banking sectors regularly employ the technique to improve detection of rare but costly adverse events.

Key Considerations

The method assumes minority class samples are sufficiently dense to form meaningful neighbourhoods; sparse or highly scattered minority data may produce poor-quality synthetics. Generated samples exist in interpolated regions that may not reflect true underlying data distribution, and parameter tuning (particularly k and over-sampling ratio) significantly influences results.

Cross-References(1)

Machine Learning

More in Machine Learning