Overview
Direct Answer
UMAP (Uniform Manifold Approximation and Projection) is a non-linear dimensionality reduction technique that preserves both local and global structure in high-dimensional data, enabling effective visualisation and feature engineering. It constructs a weighted k-nearest-neighbour graph in high-dimensional space, then optimises a low-dimensional representation to maintain topological relationships.
How It Works
UMAP builds a fuzzy topological representation of input data by computing local connectivity metrics around each point, then uses stochastic gradient descent to position points in a lower-dimensional space whilst preserving the manifold structure. The algorithm balances attraction between nearby points and repulsion between distant ones, leveraging theoretical foundations in Riemannian geometry and algebraic topology to guide the embedding process.
Why It Matters
Organisations rely on UMAP for faster exploratory data analysis and cluster visualisation compared to traditional t-SNE, particularly when handling datasets exceeding millions of samples. The technique significantly reduces computational burden whilst maintaining interpretability, enabling data scientists to identify patterns, detect anomalies, and validate preprocessing decisions before downstream modelling.
Common Applications
Applications span single-cell genomics for analysing gene expression, single-cell RNA-sequencing visualisation in bioinformatics, image dataset exploration in computer vision, and clustering validation across finance and healthcare sectors. The method also supports feature extraction in recommendation systems and embedding space analysis in natural language processing tasks.
Key Considerations
UMAP introduces hyperparameters (minimum distance, number of neighbours) that significantly influence output structure and require careful tuning for domain-specific objectives. Results remain sensitive to data preprocessing, scaling choices, and random initialisation, necessitating validation against multiple runs and complementary analysis methods rather than relying solely on visual inspection.
Cross-References(1)
More in Machine Learning
Experiment Tracking
MLOps & ProductionThe systematic recording of machine learning experiment parameters, metrics, artifacts, and code versions to enable reproducibility and comparison across training runs.
Epoch
MLOps & ProductionOne complete pass through the entire training dataset during the machine learning model training process.
Regularisation
Training TechniquesTechniques that add constraints or penalties to a model to prevent overfitting and improve generalisation to new data.
Gradient Descent
Training TechniquesAn optimisation algorithm that iteratively adjusts parameters in the direction of steepest descent of the loss function.
Semi-Supervised Learning
Advanced MethodsA learning approach that combines a small amount of labelled data with a large amount of unlabelled data during training.
Class Imbalance
Feature Engineering & SelectionA situation where the distribution of classes in a dataset is significantly skewed, with some classes vastly outnumbering others.
Overfitting
Training TechniquesWhen a model learns the training data too well, including noise, resulting in poor performance on unseen data.
Bagging
Advanced MethodsBootstrap Aggregating — an ensemble method that trains multiple models on random subsets of data and averages their predictions.