Machine LearningFeature Engineering & Selection

Data Augmentation

Overview

Direct Answer

Data augmentation encompasses techniques that synthetically expand training datasets by applying domain-relevant transformations to existing samples, thereby increasing both volume and distributional diversity without collecting new raw data. Common transformations include geometric operations (rotation, translation, scaling), colour/brightness adjustments, and noise injection.

How It Works

The mechanism operates by applying parameterised transformations to individual training examples, generating new variants that preserve semantic labels whilst introducing controlled variance. For image data, transformations are applied during training loops; for text, techniques include back-translation and token replacement. The augmented dataset passes through the standard training pipeline, exposing the model to greater input variability without manual data collection.

Why It Matters

Augmentation directly addresses data scarcity—a primary constraint in machine learning projects—reducing annotation costs and accelerating model development cycles. Improved generalisation through exposure to transformed variants typically reduces overfitting and enhances robustness to real-world input variations, critical for production deployment.

Common Applications

Medical imaging relies heavily on rotation and elastic deformation to expand limited patient datasets. Computer vision systems employ augmentation for object detection and classification tasks. Natural language processing applications use paraphrasing and back-translation to strengthen text classifiers and machine translation models.

Key Considerations

Augmentation must remain semantically faithful to preserve label correctness; aggressive or inappropriate transformations introduce label noise and degrade performance. Domain expertise is essential—transformations effective for one modality prove counterproductive in another.

More in Machine Learning