Model Pruning

Overview

Direct Answer

Model pruning is a compression technique that removes weights, neurons, or entire layers from a trained neural network based on their contribution to model performance. This reduces model size and inference latency whilst typically preserving accuracy within acceptable thresholds.

How It Works

Pruning algorithms identify and eliminate parameters below importance thresholds calculated through magnitude-based scoring, gradient analysis, or sensitivity measurement. Weights close to zero are removed first, followed by fine-tuning to recover any accuracy loss. Structured pruning removes entire filters or channels; unstructured pruning removes individual weights.

Why It Matters

Reduced model size enables deployment on edge devices, mobile platforms, and resource-constrained environments where memory and power consumption are critical constraints. Faster inference directly decreases latency and operational costs in cloud-hosted inference services. This accessibility expands deep learning adoption across embedded systems and real-time applications.

Common Applications

Computer vision models for mobile deployment, natural language processing systems for edge inference, recommendation systems optimised for low-latency serving, and autonomous vehicle perception modules operating under strict computational budgets benefit from this technique.

Key Considerations

Aggressive pruning can degrade model accuracy or introduce instability; practitioners must balance compression gains against performance requirements. Unstructured pruning may offer better accuracy preservation but requires specialised hardware acceleration; structured approaches sacrifice less accuracy but provide broader hardware compatibility.

Cross-References(1)

Deep Learning

Neural Network

Related in Models & Architecture

Tensor Processing Unit

Google's custom-designed application-specific integrated circuit for accelerating machine learning workloads.

Neural Processing Unit

A specialised processor designed to accelerate neural network computations in edge devices and mobile platforms.

Model Distillation

A technique where a smaller, simpler model is trained to replicate the behaviour of a larger, more complex model.

Neural Architecture Search

An automated technique for designing optimal neural network architectures using search algorithms.

Model Quantisation

The process of reducing the numerical precision of a model's weights and activations from floating-point to lower-bit representations, decreasing memory usage and inference latency.

Sparse Attention

An attention mechanism that selectively computes relationships between a subset of input tokens rather than all pairs, reducing quadratic complexity in transformer models.

Model Collapse

A degradation phenomenon where AI models trained on AI-generated data progressively lose diversity and accuracy, converging toward a narrow distribution of outputs.

Neural Scaling Laws

Empirical relationships describing how AI model performance improves predictably with increases in model size, training data volume, and computational resources.

Speculative Decoding

An inference acceleration technique where a small draft model generates candidate token sequences that are verified in parallel by the larger target model.

More in Artificial Intelligence

Ontology

Foundations & Theory

A formal representation of knowledge as a set of concepts, categories, and relationships within a specific domain.

Retrieval-Augmented Generation

Infrastructure & Operations

A technique combining information retrieval with text generation, allowing AI to access external knowledge before generating responses.

Direct Preference Optimisation

Training & Inference

A simplified alternative to RLHF that directly optimises language model policies using preference data without requiring a separate reward model.

Heuristic Search

Reasoning & Planning

Problem-solving techniques that use practical rules of thumb to find satisfactory solutions when exhaustive search is impractical.

AI Model Card

Safety & Governance

A documentation framework that provides standardised information about an AI model's intended use, performance characteristics, limitations, and ethical considerations.

Expert System