K-Nearest Neighbours — Technology Wiki

Overview

Direct Answer

K-Nearest Neighbours (KNN) is a non-parametric, instance-based learning algorithm that classifies data points by identifying the k closest training examples in feature space and assigning the majority class label among those neighbours. Unlike parametric models, it makes no assumptions about underlying data distribution.

How It Works

The algorithm calculates distances (typically Euclidean or Manhattan) between a query point and all training samples, then selects the k nearest instances. Classification is determined by majority voting among these k neighbours; regression variants average their target values. Distance metric and k value selection directly influence model behaviour and accuracy.

Why It Matters

KNN remains valuable for rapid prototyping and problems with non-linear decision boundaries where linear assumptions fail. Its interpretability—decisions trace directly to nearest examples—supports explainability requirements in regulated sectors. Performance depends heavily on feature scaling and neighbourhood size, making it essential for baseline comparisons.

Common Applications

The method is widely deployed in recommendation systems, medical diagnosis support (identifying similar patient cases), credit scoring, and image recognition. Collaborative filtering systems use distance-based neighbour selection to suggest content, whilst spatial analysis applications leverage its natural handling of geometric relationships.

Key Considerations

Computational cost scales linearly with training set size since all distances must be calculated at prediction time, making it impractical for massive datasets without optimisation techniques like KD-trees or ball trees. Curse of dimensionality severely degrades performance in high-dimensional spaces where distance metrics become less meaningful.

Related in Supervised Learning

Boosting

An ensemble technique that sequentially trains models, each focusing on correcting the errors of previous models.

Random Forest

An ensemble learning method that constructs multiple decision trees during training and outputs the mode of their predictions.

Gradient Boosting

An ensemble technique that builds models sequentially, with each new model correcting residual errors of the combined ensemble.

XGBoost

An optimised distributed gradient boosting library designed for speed and performance in machine learning competitions and production.

Decision Tree

A tree-structured model where internal nodes represent feature tests, branches represent outcomes, and leaves represent predictions.

Support Vector Machine

A supervised learning algorithm that finds the optimal hyperplane to separate different classes in high-dimensional space.

Naive Bayes

A probabilistic classifier based on applying Bayes' theorem with the assumption of independence between features.

Linear Regression

A statistical method modelling the relationship between a dependent variable and one or more independent variables using a linear equation.

Logistic Regression

A classification algorithm that models the probability of a binary outcome using a logistic function.

Polynomial Regression

A form of regression analysis where the relationship between variables is modelled as an nth degree polynomial.

Tabular Deep Learning

The application of deep neural networks to structured tabular datasets, competing with traditional methods like gradient boosting through specialised architectures and regularisation.

More in Machine Learning

Label Noise

Feature Engineering & Selection

Errors or inconsistencies in the annotations of training data that can degrade model performance and lead to unreliable predictions if not properly addressed.

Online Learning

MLOps & Production

A machine learning method where models are incrementally updated as new data arrives, rather than being trained in batch.

Deep Reinforcement Learning

Reinforcement Learning

Combining deep neural networks with reinforcement learning to enable agents to learn complex decision-making from raw sensory input.

Reinforcement Learning

MLOps & Production

A machine learning paradigm where agents learn optimal behaviour through trial and error, receiving rewards or penalties.

t-SNE

Unsupervised Learning

t-Distributed Stochastic Neighbour Embedding — a technique for visualising high-dimensional data in two or three dimensions.

Curriculum Learning

Advanced Methods

A training strategy that presents examples to a model in a meaningful order, typically from easy to hard.

Ridge Regression

Training Techniques

A regularised regression technique that adds an L2 penalty term to prevent overfitting by constraining coefficient magnitudes.

Lasso Regression

Feature Engineering & Selection

A regularised regression technique that adds an L1 penalty, enabling feature selection by driving some coefficients to zero.