Overview
Direct Answer
SMOTE is a data preprocessing technique that addresses class imbalance by generating synthetic training examples in the feature space of the minority class, rather than simply duplicating existing minority instances. It uses k-nearest neighbours to create new synthetic samples along the line segments connecting minority class examples.
How It Works
The algorithm identifies minority class samples and, for each one, locates its k-nearest neighbours (typically k=5) within the same class. New synthetic samples are then generated by randomly interpolating between a minority instance and one of its selected neighbours, positioning them at random points along the connecting line in feature space. This process is repeated until the desired balance ratio is achieved.
Why It Matters
Class imbalance severely degrades classifier performance on minority classes, leading to poor recall and F1-scores in critical domains such as fraud detection, disease diagnosis, and anomaly identification. By synthesising rather than replicating examples, the technique increases effective training set size whilst enabling classifiers to learn decision boundaries more effectively without overfitting to genuine minority patterns.
Common Applications
Applications include credit card fraud detection, medical diagnosis with rare diseases, network intrusion detection, and manufacturing defect identification. Telecommunications and banking sectors regularly employ the technique to improve detection of rare but costly adverse events.
Key Considerations
The method assumes minority class samples are sufficiently dense to form meaningful neighbourhoods; sparse or highly scattered minority data may produce poor-quality synthetics. Generated samples exist in interpolated regions that may not reflect true underlying data distribution, and parameter tuning (particularly k and over-sampling ratio) significantly influences results.
Cross-References(1)
More in Machine Learning
Model Serialisation
MLOps & ProductionThe process of converting a trained model into a format that can be stored, transferred, and later reconstructed for inference.
Tabular Deep Learning
Supervised LearningThe application of deep neural networks to structured tabular datasets, competing with traditional methods like gradient boosting through specialised architectures and regularisation.
Logistic Regression
Supervised LearningA classification algorithm that models the probability of a binary outcome using a logistic function.
Naive Bayes
Supervised LearningA probabilistic classifier based on applying Bayes' theorem with the assumption of independence between features.
Ensemble Learning
MLOps & ProductionCombining multiple machine learning models to produce better predictive performance than any single model.
Deep Reinforcement Learning
Reinforcement LearningCombining deep neural networks with reinforcement learning to enable agents to learn complex decision-making from raw sensory input.
Regularisation
Training TechniquesTechniques that add constraints or penalties to a model to prevent overfitting and improve generalisation to new data.
Experiment Tracking
MLOps & ProductionThe systematic recording of machine learning experiment parameters, metrics, artifacts, and code versions to enable reproducibility and comparison across training runs.