Deep LearningArchitectures

Weight Decay

Overview

Direct Answer

Weight decay is a regularisation technique that penalises model parameters by adding a scaled fraction of their magnitude to the loss function during optimisation. This approach reduces the tendency of neural networks to learn excessively large weights, thereby mitigating overfitting and improving generalisation to unseen data.

How It Works

The mechanism adds a term proportional to the L2 norm of weights (or L1 in some variants) to the total loss. During backpropagation, this additional penalty causes gradient updates to shrink weights towards zero, creating an implicit bias towards simpler, less complex parameter configurations. The strength of regularisation is controlled via a hyperparameter (decay rate), which balances model expressiveness against constraint severity.

Why It Matters

Practitioners employ weight decay to improve model robustness and reduce computational overhead of training large networks. In production systems, regularised models demonstrate more stable inference behaviour and lower memory footprints, directly reducing operational costs and inference latency in resource-constrained environments.

Common Applications

Weight decay is standard practice in computer vision tasks including image classification and object detection, natural language processing architectures, and reinforcement learning agents. It remains integral to modern optimisers including Adam and SGD implementations across frameworks such as PyTorch and TensorFlow.

Key Considerations

The decay rate requires careful tuning relative to learning rate and batch size; excessive regularisation suppresses model capacity unnecessarily, whilst insufficient regularisation fails to prevent overfitting. Practitioners should distinguish weight decay from L2 regularisation in adaptive optimisers, where decoupled weight decay (AdamW) provides more consistent performance across hyperparameter configurations.

Cross-References(3)

More in Deep Learning

See Also