UMAP-Learn

Overview

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique for visualization and general non-linear dimensionality reduction. Apply this skill for fast, scalable embeddings that preserve local and global structure, supervised learning, and clustering preprocessing.

Quick Start

Installation

Requires Python 3.9+. Pin to a verified release:

uv pip install umap-learn==0.5.12

Basic Usage

UMAP follows scikit-learn conventions and can be used as a drop-in replacement for t-SNE or PCA.

import umap
from sklearn.preprocessing import StandardScaler

# Prepare data (standardization is essential)
scaled_data = StandardScaler().fit_transform(data)

# Method 1: Single step (fit and transform)
embedding = umap.UMAP().fit_transform(scaled_data)

# Method 2: Separate steps (for reusing trained model)
reducer = umap.UMAP(random_state=42)
reducer.fit(scaled_data)
embedding = reducer.embedding_  # Access the trained embedding

Critical preprocessing requirement: Always standardize features to comparable scales before applying UMAP to ensure equal weighting across dimensions.

Typical Workflow

import umap
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# 1. Preprocess data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(raw_data)

# 2. Create and fit UMAP
reducer = umap.UMAP(
    n_neighbors=15,
    min_dist=0.1,
    n_components=2,
    metric='euclidean',
    random_state=42
)
embedding = reducer.fit_transform(scaled_data)

# 3. Visualize
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Embedding')
plt.show()

Parameter Tuning Guide

UMAP has four primary parameters that control the embedding behavior. Understanding these is crucial for effective usage.

n_neighbors (default: 15)

Purpose: Balances local versus global structure in the embedding.

How it works: Controls the size of the local neighborhood UMAP examines when learning manifold structure.

Effects by value:

Low values (2-5): Emphasizes fine local detail but may fragment data into disconnected components
Medium values (15-20): Balanced view of both local structure and global relationships (recommended starting point)
High values (50-200): Prioritizes broad topological structure at the expense of fine-grained details

Recommendation: Start with 15 and adjust based on results. Increase for more global structure, decrease for more local detail.

min_dist (default: 0.1)

Purpose: Controls how tightly points cluster in the low-dimensional space.

How it works: Sets the minimum distance apart that points are allowed to be in the output representation.

Effects by value:

Low values (0.0-0.1): Creates clumped embeddings useful for clustering; reveals fine topological details
High values (0.5-0.99): Prevents tight packing; emphasizes broad topological preservation over local structure

Recommendation: Use 0.0 for clustering applications, 0.1-0.3 for visualization, 0.5+ for loose structure.

n_components (default: 2)

Purpose: Determines the dimensionality of the embedded output space.

Key feature: Unlike t-SNE, UMAP scales well in the embedding dimension, enabling use beyond visualization.

Common uses:

2-3 dimensions: Visualization
5-10 dimensions: Clustering preprocessing (better preserves density than 2D)
10-50 dimensions: Feature engineering for downstream ML models

Recommendation: Use 2 for visualization, 5-10 for clustering, higher for ML pipelines.

metric (default: 'euclidean')

Purpose: Specifies how distance is calculated between input data points.

Supported metrics:

Minkowski variants: euclidean, manhattan, chebyshev
Spatial metrics: canberra, braycurtis, haversine
Correlation metrics: cosine, correlation (good for text/document embeddings)
Binary data metrics: hamming, jaccard, dice, russellrao, kulsinski, rogerstanimoto, sokalmichener, sokalsneath, yule
Custom metrics: User-defined distance functions via Numba

Recommendation: Use euclidean for numeric data, cosine for text/document vectors, hamming for binary data.

Parameter Tuning Example

# For visualization with emphasis on local structure
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean')

# For clustering preprocessing
umap.UMAP(n_neighbors=30, min_dist=0.0, n_components=10, metric='euclidean')

# For document embeddings
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='cosine')

# For preserving global structure
umap.UMAP(n_neighbors=100, min_dist=0.5, n_components=2, metric='euclidean')

Supervised and Semi-Supervised Dimension Reduction

UMAP supports incorporating label information to guide the embedding process, enabling class separation while preserving internal structure.

Supervised UMAP

Pass target labels via the y parameter when fitting:

# Supervised dimension reduction
embedding = umap.UMAP().fit_transform(data, y=labels)

Key benefits:

Achieves cleanly separated classes
Preserves internal structure within each class
Maintains global relationships between classes

When to use: When you have labeled data and want to separate known classes while keeping meaningful point embeddings.

Semi-Supervised UMAP

For partial labels, mark unlabeled points with -1 following scikit-learn convention:

# Create semi-supervised labels
semi_labels = labels.copy()
semi_labels[unlabeled_indices] = -1

# Fit with partial labels
embedding = umap.UMAP().fit_transform(data, y=semi_labels)

When to use: When labeling is expensive or you have more data than labels available.

Metric Learning with UMAP

Train a supervised embedding on labeled data, then apply to new unlabeled data:

# Train on labeled data
mapper = umap.UMAP().fit(train_data, train_labels)

# Transform unlabeled test data
test_embedding = mapper.transform(test_data)

# Use as feature engineering for downstream classifier
from sklearn.svm import SVC
clf = SVC().fit(mapper.embedding_, train_labels)
predictions = clf.predict(test_embedding)

When to use: For supervised feature engineering in machine learning pipelines.

UMAP for Clustering

UMAP serves as effective preprocessing for density-based clustering algorithms like HDBSCAN, overcoming the curse of dimensionality.

Best Practices for Clustering

Key principle: Configure UMAP differently for clustering than for visualization.

Recommended parameters:

n_neighbors: Increase to ~30 (default 15 is too local and can create artificial fine-grained clusters)
min_dist: Set to 0.0 (pack points densely within clusters for clearer boundaries)
n_components: Use 5-10 dimensions (maintains performance while improving density preservation vs. 2D)

Clustering Workflow

Install HDBSCAN separately for density-based clustering:

uv pip install hdbscan

import umap
import hdbscan
from sklearn.preprocessing import StandardScaler

# 1. Preprocess data
scaled_data = StandardScaler().fit_transform(data)

# 2. UMAP with clustering-optimized parameters
reducer = umap.UMAP(
    n_neighbors=30,
    min_dist=0.0,
    n_components=10,  # Higher than 2 for better density preservation
    metric='euclidean',
    random_state=42
)
embedding = reducer.fit_transform(scaled_data)

# 3. Apply HDBSCAN clustering
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=15,
    min_samples=5,
    metric='euclidean'
)
labels = clusterer.fit_predict(embedding)

# 4. Evaluate
from sklearn.metrics import adjusted_rand_score
score = adjusted_rand_score(true_labels, labels)
print(f"Adjusted Rand Score: {score:.3f}")
print(f"Number of clusters: {len(set(labels)) - (1 if -1 in labels else 0)}")
print(f"Noise poin

umap-learn

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

xlsx

mem-search

weekly-digests

how-it-works

Recibe nuevas skills de Dados e Análise todos los lunes