Machine LearningFeature Engineering & Selection

Class Imbalance

Overview

Direct Answer

Class imbalance occurs when training datasets contain a disproportionate number of examples from certain classes relative to others, such as rare disease diagnosis datasets where negative cases vastly outnumber positive cases. This skewed distribution causes standard machine learning algorithms to develop biased models that achieve high overall accuracy whilst performing poorly on minority classes.

How It Works

During training, loss functions and evaluation metrics optimise for overall accuracy rather than per-class performance. Algorithms learn to favour the dominant class because correctly predicting the majority class contributes disproportionately to minimising total error. As a result, the model learns minimal or superficial patterns for minority classes, often defaulting to predicting the most frequent class regardless of input features.

Why It Matters

Imbalanced datasets produce models with poor real-world performance in critical applications where minority classes represent high-value or high-risk outcomes—fraud detection, equipment failure prediction, and medical diagnosis all depend on accurately identifying rare but consequential events. Poor minority-class performance directly impacts cost, reliability, and regulatory compliance in production systems.

Common Applications

Applications include fraud detection where fraudulent transactions represent <1% of volume, credit risk assessment with sparse default cases, medical imaging where disease-positive scans are rare, cybersecurity threat detection, and manufacturing quality control identifying defective units within normal production batches.

Key Considerations

Practitioners must choose appropriate mitigation strategies—resampling, cost-sensitive learning, threshold adjustment, and ensemble methods—based on data size and business objectives. Evaluation metrics must shift from accuracy to F1-score, precision-recall curves, or area under the receiver operating characteristic curve to properly assess minority-class performance.

Referenced By1 term mentions Class Imbalance

Other entries in the wiki whose definition references Class Imbalance — useful for understanding how this concept connects across Machine Learning and adjacent domains.

More in Machine Learning