Overview
Direct Answer
Clustering is an unsupervised learning technique that partitions datasets into groups of similar data points without requiring predefined class labels. It identifies inherent patterns and structures within data by measuring similarity or distance between observations.
How It Works
Clustering algorithms compute similarity metrics (such as Euclidean distance or cosine similarity) between data points and iteratively assign observations to groups that minimise within-group variance or maximise cohesion. Common approaches include centroid-based methods like K-means, density-based methods like DBSCAN, and hierarchical approaches that build dendrograms of nested partitions.
Why It Matters
Organisations use clustering to discover hidden customer segments, reduce dimensionality for downstream analysis, and identify anomalies without manual labelling costs. It enables data-driven decision-making in scenarios where ground truth is unavailable or expensive to obtain.
Common Applications
Applications include customer segmentation in retail and marketing, genomic sequence grouping in bioinformatics, document organisation in information retrieval, and anomaly detection in cybersecurity. It supports image segmentation in computer vision and helps identify disease subtypes in medical research.
Key Considerations
Practitioners must select appropriate distance metrics and algorithm families based on data geometry, as results are sensitive to initialisation and feature scaling. Determining the optimal number of clusters remains a fundamental challenge requiring domain expertise and validation metrics like silhouette scores.
Cross-References(1)
Referenced By4 terms mention Clustering
Other entries in the wiki whose definition references Clustering — useful for understanding how this concept connects across Machine Learning and adjacent domains.
More in Machine Learning
Overfitting
Training TechniquesWhen a model learns the training data too well, including noise, resulting in poor performance on unseen data.
Random Forest
Supervised LearningAn ensemble learning method that constructs multiple decision trees during training and outputs the mode of their predictions.
Stochastic Gradient Descent
Training TechniquesA variant of gradient descent that updates parameters using a randomly selected subset of training data each iteration.
Model Monitoring
MLOps & ProductionContinuous observation of deployed machine learning models to detect performance degradation, data drift, anomalous predictions, and infrastructure issues in production.
Backpropagation
Training TechniquesThe algorithm for computing gradients of the loss function with respect to network weights, enabling neural network training.
Lasso Regression
Feature Engineering & SelectionA regularised regression technique that adds an L1 penalty, enabling feature selection by driving some coefficients to zero.
Bandit Algorithm
Advanced MethodsAn online learning algorithm that balances exploration of new options with exploitation of known good options to maximise reward.
A/B Testing
Training TechniquesA controlled experiment comparing two variants to determine which performs better against a defined metric.