UMAP-Learn
Overview
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique for visualization and general non-linear dimensionality reduction. Apply this skill for fast, scalable embeddings that preserve local and global structure, supervised learning, and clustering preprocessing.
Quick Start
Installation
Requires Python 3.9+. Pin to a verified release:
uv pip install umap-learn==0.5.12
Basic Usage
UMAP follows scikit-learn conventions and can be used as a drop-in replacement for t-SNE or PCA.
import umap
from sklearn.preprocessing import StandardScaler
# Prepare data (standardization is essential)
scaled_data = StandardScaler().fit_transform(data)
# Method 1: Single step (fit and transform)
embedding = umap.UMAP().fit_transform(scaled_data)
# Method 2: Separate steps (for reusing trained model)
reducer = umap.UMAP(random_state=42)
reducer.fit(scaled_data)
embedding = reducer.embedding_ # Access the trained embedding
Critical preprocessing requirement: Always standardize features to comparable scales before applying UMAP to ensure equal weighting across dimensions.
Typical Workflow
import umap
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
# 1. Preprocess data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(raw_data)
# 2. Create and fit UMAP
reducer = umap.UMAP(
n_neighbors=15,
min_dist=0.1,
n_components=2,
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
# 3. Visualize
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Embedding')
plt.show()
Parameter Tuning Guide
UMAP has four primary parameters that control the embedding behavior. Understanding these is crucial for effective usage.
n_neighbors (default: 15)
Purpose: Balances local versus global structure in the embedding.
How it works: Controls the size of the local neighborhood UMAP examines when learning manifold structure.
Effects by value:
- Low values (2-5): Emphasizes fine local detail but may fragment data into disconnected components
- Medium values (15-20): Balanced view of both local structure and global relationships (recommended starting point)
- High values (50-200): Prioritizes broad topological structure at the expense of fine-grained details
Recommendation: Start with 15 and adjust based on results. Increase for more global structure, decrease for more local detail.
min_dist (default: 0.1)
Purpose: Controls how tightly points cluster in the low-dimensional space.
How it works: Sets the minimum distance apart that points are allowed to be in the output representation.
Effects by value:
- Low values (0.0-0.1): Creates clumped embeddings useful for clustering; reveals fine topological details
- High values (0.5-0.99): Prevents tight packing; emphasizes broad topological preservation over local structure
Recommendation: Use 0.0 for clustering applications, 0.1-0.3 for visualization, 0.5+ for loose structure.
n_components (default: 2)
Purpose: Determines the dimensionality of the embedded output space.
Key feature: Unlike t-SNE, UMAP scales well in the embedding dimension, enabling use beyond visualization.
Common uses:
- 2-3 dimensions: Visualization
- 5-10 dimensions: Clustering preprocessing (better preserves density than 2D)
- 10-50 dimensions: Feature engineering for downstream ML models
Recommendation: Use 2 for visualization, 5-10 for clustering, higher for ML pipelines.
metric (default: 'euclidean')
Purpose: Specifies how distance is calculated between input data points.
Supported metrics:
- Minkowski variants: euclidean, manhattan, chebyshev
- Spatial metrics: canberra, braycurtis, haversine
- Correlation metrics: cosine, correlation (good for text/document embeddings)
- Binary data metrics: hamming, jaccard, dice, russellrao, kulsinski, rogerstanimoto, sokalmichener, sokalsneath, yule
- Custom metrics: User-defined distance functions via Numba
Recommendation: Use euclidean for numeric data, cosine for text/document vectors, hamming for binary data.
Parameter Tuning Example
# For visualization with emphasis on local structure
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean')
# For clustering preprocessing
umap.UMAP(n_neighbors=30, min_dist=0.0, n_components=10, metric='euclidean')
# For document embeddings
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='cosine')
# For preserving global structure
umap.UMAP(n_neighbors=100, min_dist=0.5, n_components=2, metric='euclidean')
Supervised and Semi-Supervised Dimension Reduction
UMAP supports incorporating label information to guide the embedding process, enabling class separation while preserving internal structure.
Supervised UMAP
Pass target labels via the y parameter when fitting:
# Supervised dimension reduction
embedding = umap.UMAP().fit_transform(data, y=labels)
Key benefits:
- Achieves cleanly separated classes
- Preserves internal structure within each class
- Maintains global relationships between classes
When to use: When you have labeled data and want to separate known classes while keeping meaningful point embeddings.
Semi-Supervised UMAP
For partial labels, mark unlabeled points with -1 following scikit-learn convention:
# Create semi-supervised labels
semi_labels = labels.copy()
semi_labels[unlabeled_indices] = -1
# Fit with partial labels
embedding = umap.UMAP().fit_transform(data, y=semi_labels)
When to use: When labeling is expensive or you have more data than labels available.
Metric Learning with UMAP
Train a supervised embedding on labeled data, then apply to new unlabeled data:
# Train on labeled data
mapper = umap.UMAP().fit(train_data, train_labels)
# Transform unlabeled test data
test_embedding = mapper.transform(test_data)
# Use as feature engineering for downstream classifier
from sklearn.svm import SVC
clf = SVC().fit(mapper.embedding_, train_labels)
predictions = clf.predict(test_embedding)
When to use: For supervised feature engineering in machine learning pipelines.
UMAP for Clustering
UMAP serves as effective preprocessing for density-based clustering algorithms like HDBSCAN, overcoming the curse of dimensionality.
Best Practices for Clustering
Key principle: Configure UMAP differently for clustering than for visualization.
Recommended parameters:
- n_neighbors: Increase to ~30 (default 15 is too local and can create artificial fine-grained clusters)
- min_dist: Set to 0.0 (pack points densely within clusters for clearer boundaries)
- n_components: Use 5-10 dimensions (maintains performance while improving density preservation vs. 2D)
Clustering Workflow
Install HDBSCAN separately for density-based clustering:
uv pip install hdbscan
import umap
import hdbscan
from sklearn.preprocessing import StandardScaler
# 1. Preprocess data
scaled_data = StandardScaler().fit_transform(data)
# 2. UMAP with clustering-optimized parameters
reducer = umap.UMAP(
n_neighbors=30,
min_dist=0.0,
n_components=10, # Higher than 2 for better density preservation
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
# 3. Apply HDBSCAN clustering
clusterer = hdbscan.HDBSCAN(
min_cluster_size=15,
min_samples=5,
metric='euclidean'
)
labels = clusterer.fit_predict(embedding)
# 4. Evaluate
from sklearn.metrics import adjusted_rand_score
score = adjusted_rand_score(true_labels, labels)
print(f"Adjusted Rand Score: {score:.3f}")
print(f"Number of clusters: {len(set(labels)) - (1 if -1 in labels else 0)}")
print(f"Noise poin