Overview
Direct Answer
A skip connection is an architectural pattern that creates a direct pathway for the output of an earlier layer to be added element-wise to the output of a deeper layer, bypassing intermediate layers entirely. This mechanism enables networks to learn both residual transformations and identity mappings simultaneously.
How It Works
During forward propagation, the activation tensor from layer n is added directly to the output of layer n+k, where k represents the number of skipped layers. Backpropagation then routes gradients through both the skip pathway and the standard computational path, creating multiple gradient flow routes. This dual-path architecture reduces the vanishing gradient problem by ensuring that gradients maintain sufficient magnitude even in very deep networks.
Why It Matters
Skip connections substantially improve training stability and convergence speed in networks exceeding 50+ layers, directly reducing computational costs and wall-clock training time. They enable organisations to train significantly deeper models that achieve superior accuracy on vision and sequence tasks whilst remaining practically trainable on standard hardware infrastructure.
Common Applications
Residual networks (ResNets) in image classification and object detection; transformer architectures in natural language processing and large language models; U-Net style encoders in medical image segmentation; and very deep convolutional networks in autonomous vehicle perception systems.
Key Considerations
Skip connections require compatible tensor dimensions between source and target layers; dimensional mismatches necessitate learnable projection layers that increase computational overhead. The benefits diminish in shallow networks and recurrent architectures where gradient flow is naturally less problematic.
Cross-References(1)
Referenced By1 term mentions Skip Connection
Other entries in the wiki whose definition references Skip Connection — useful for understanding how this concept connects across Deep Learning and adjacent domains.
More in Deep Learning
Sigmoid Function
Training & OptimisationAn activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.
ReLU
Training & OptimisationRectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.
Positional Encoding
Training & OptimisationA technique that injects information about the position of tokens in a sequence into transformer architectures.
Self-Attention
Training & OptimisationAn attention mechanism where each element in a sequence attends to all other elements to compute its representation.
Prefix Tuning
Language ModelsA parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.
Model Parallelism
ArchitecturesA distributed training approach that partitions a model across multiple devices, enabling training of models too large to fit in a single accelerator's memory.
Data Parallelism
ArchitecturesA distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.
Capsule Network
ArchitecturesA neural network architecture that groups neurons into capsules to better capture spatial hierarchies and part-whole relationships.