Overview
Direct Answer
An activation function is a mathematical operation applied to the weighted sum of inputs at each neuron, introducing non-linearity to enable neural networks to learn complex, non-linear relationships in data. Without it, stacked layers would collapse into a single linear transformation, severely limiting representational capacity.
How It Works
During forward propagation, each neuron computes a weighted sum of its inputs plus a bias term, then passes this value through the chosen function (such as ReLU, sigmoid, or tanh) before outputting to the next layer. This non-linear transformation allows the network to approximate arbitrary functions. During backpropagation, the derivative of the function is used to compute gradients for weight updates.
Why It Matters
Selection of the appropriate function directly impacts training speed, convergence behaviour, and final model accuracy. Poor choices can cause vanishing or exploding gradients, slowing training significantly or preventing learning altogether. Efficient functions like ReLU reduce computational overhead, lowering inference costs in production systems.
Common Applications
ReLU is standard in convolutional neural networks for image recognition tasks. Sigmoid and tanh remain prevalent in recurrent networks for time-series forecasting. Softmax is essential in multi-class classification layers across natural language processing and computer vision applications.
Key Considerations
ReLU units can suffer from the 'dying ReLU' problem where neurons become inactive permanently. The choice must align with the output layer's requirements: sigmoid for binary classification, softmax for multi-class, and linear for regression tasks.
Cross-References(1)
Referenced By3 terms mention Activation Function
Other entries in the wiki whose definition references Activation Function — useful for understanding how this concept connects across Deep Learning and adjacent domains.
More in Deep Learning
Diffusion Model
Generative ModelsA generative model that learns to reverse a gradual noising process, generating high-quality samples from random noise.
Recurrent Neural Network
ArchitecturesA neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.
Contrastive Learning
ArchitecturesA self-supervised learning approach that trains models by comparing similar and dissimilar pairs of data representations.
Long Short-Term Memory
ArchitecturesA recurrent neural network architecture designed to learn long-term dependencies by using gating mechanisms to control information flow.
Generative Adversarial Network
Generative ModelsA framework where two neural networks compete — a generator creates synthetic data while a discriminator evaluates its authenticity.
Graph Neural Network
ArchitecturesA neural network designed to operate on graph-structured data, learning representations of nodes, edges, and entire graphs.
Gradient Checkpointing
ArchitecturesA memory optimisation that trades computation for memory by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.
Vision Transformer
ArchitecturesA transformer architecture adapted for image recognition that divides images into patches and processes them as sequences, rivalling convolutional networks in visual tasks.