Overview
Direct Answer
K-Nearest Neighbours (KNN) is a non-parametric, instance-based learning algorithm that classifies data points by identifying the k closest training examples in feature space and assigning the majority class label among those neighbours. Unlike parametric models, it makes no assumptions about underlying data distribution.
How It Works
The algorithm calculates distances (typically Euclidean or Manhattan) between a query point and all training samples, then selects the k nearest instances. Classification is determined by majority voting among these k neighbours; regression variants average their target values. Distance metric and k value selection directly influence model behaviour and accuracy.
Why It Matters
KNN remains valuable for rapid prototyping and problems with non-linear decision boundaries where linear assumptions fail. Its interpretability—decisions trace directly to nearest examples—supports explainability requirements in regulated sectors. Performance depends heavily on feature scaling and neighbourhood size, making it essential for baseline comparisons.
Common Applications
The method is widely deployed in recommendation systems, medical diagnosis support (identifying similar patient cases), credit scoring, and image recognition. Collaborative filtering systems use distance-based neighbour selection to suggest content, whilst spatial analysis applications leverage its natural handling of geometric relationships.
Key Considerations
Computational cost scales linearly with training set size since all distances must be calculated at prediction time, making it impractical for massive datasets without optimisation techniques like KD-trees or ball trees. Curse of dimensionality severely degrades performance in high-dimensional spaces where distance metrics become less meaningful.
More in Machine Learning
Active Learning
MLOps & ProductionA machine learning approach where the algorithm interactively queries a user or oracle to label new data points.
Bagging
Advanced MethodsBootstrap Aggregating — an ensemble method that trains multiple models on random subsets of data and averages their predictions.
DBSCAN
Unsupervised LearningDensity-Based Spatial Clustering of Applications with Noise — a clustering algorithm that finds arbitrarily shaped clusters based on density.
Lasso Regression
Feature Engineering & SelectionA regularised regression technique that adds an L1 penalty, enabling feature selection by driving some coefficients to zero.
t-SNE
Unsupervised Learningt-Distributed Stochastic Neighbour Embedding — a technique for visualising high-dimensional data in two or three dimensions.
K-Means Clustering
Unsupervised LearningA partitioning algorithm that divides data into k clusters by minimising the distance between points and their cluster centroids.
Ensemble Learning
MLOps & ProductionCombining multiple machine learning models to produce better predictive performance than any single model.
UMAP
Unsupervised LearningUniform Manifold Approximation and Projection — a dimensionality reduction technique for visualisation and general non-linear reduction.