Overview
Direct Answer
Class imbalance occurs when training datasets contain a disproportionate number of examples from certain classes relative to others, such as rare disease diagnosis datasets where negative cases vastly outnumber positive cases. This skewed distribution causes standard machine learning algorithms to develop biased models that achieve high overall accuracy whilst performing poorly on minority classes.
How It Works
During training, loss functions and evaluation metrics optimise for overall accuracy rather than per-class performance. Algorithms learn to favour the dominant class because correctly predicting the majority class contributes disproportionately to minimising total error. As a result, the model learns minimal or superficial patterns for minority classes, often defaulting to predicting the most frequent class regardless of input features.
Why It Matters
Imbalanced datasets produce models with poor real-world performance in critical applications where minority classes represent high-value or high-risk outcomes—fraud detection, equipment failure prediction, and medical diagnosis all depend on accurately identifying rare but consequential events. Poor minority-class performance directly impacts cost, reliability, and regulatory compliance in production systems.
Common Applications
Applications include fraud detection where fraudulent transactions represent <1% of volume, credit risk assessment with sparse default cases, medical imaging where disease-positive scans are rare, cybersecurity threat detection, and manufacturing quality control identifying defective units within normal production batches.
Key Considerations
Practitioners must choose appropriate mitigation strategies—resampling, cost-sensitive learning, threshold adjustment, and ensemble methods—based on data size and business objectives. Evaluation metrics must shift from accuracy to F1-score, precision-recall curves, or area under the receiver operating characteristic curve to properly assess minority-class performance.
Referenced By1 term mentions Class Imbalance
Other entries in the wiki whose definition references Class Imbalance — useful for understanding how this concept connects across Machine Learning and adjacent domains.
More in Machine Learning
Loss Function
Training TechniquesA mathematical function that measures the difference between predicted outputs and actual target values during model training.
Logistic Regression
Supervised LearningA classification algorithm that models the probability of a binary outcome using a logistic function.
Ridge Regression
Training TechniquesA regularised regression technique that adds an L2 penalty term to prevent overfitting by constraining coefficient magnitudes.
Bagging
Advanced MethodsBootstrap Aggregating — an ensemble method that trains multiple models on random subsets of data and averages their predictions.
Transfer Learning
Advanced MethodsA technique where knowledge gained from training on one task is applied to a different but related task.
Feature Selection
MLOps & ProductionThe process of identifying and selecting the most relevant input variables for a machine learning model.
Online Learning
MLOps & ProductionA machine learning method where models are incrementally updated as new data arrives, rather than being trained in batch.
Dimensionality Reduction
Unsupervised LearningTechniques that reduce the number of input variables in a dataset while preserving essential information and structure.