PyTorch Geometric (PyG)
Overview
PyTorch Geometric is a library built on PyTorch for developing and training Graph Neural Networks (GNNs). Apply this skill for deep learning on graphs and irregular structures, including mini-batch processing, multi-GPU training, and geometric deep learning applications.
When to Use This Skill
This skill should be used when working with:
- Graph-based machine learning: Node classification, graph classification, link prediction
- Molecular property prediction: Drug discovery, chemical property prediction
- Social network analysis: Community detection, influence prediction
- Citation networks: Paper classification, recommendation systems
- 3D geometric data: Point clouds, meshes, molecular structures
- Heterogeneous graphs: Multi-type nodes and edges (e.g., knowledge graphs)
- Large-scale graph learning: Neighbor sampling, distributed training
Quick Start
Installation
uv pip install torch_geometric
For additional dependencies (sparse operations, clustering):
uv pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-${TORCH}+${CUDA}.html
Basic Graph Creation
import torch
from torch_geometric.data import Data
# Create a simple graph with 3 nodes
edge_index = torch.tensor([[0, 1, 1, 2], # source nodes
[1, 0, 2, 1]], dtype=torch.long) # target nodes
x = torch.tensor([[-1], [0], [1]], dtype=torch.float) # node features
data = Data(x=x, edge_index=edge_index)
print(f"Nodes: {data.num_nodes}, Edges: {data.num_edges}")
Loading a Benchmark Dataset
from torch_geometric.datasets import Planetoid
# Load Cora citation network
dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0] # Get the first (and only) graph
print(f"Dataset: {dataset}")
print(f"Nodes: {data.num_nodes}, Edges: {data.num_edges}")
print(f"Features: {data.num_node_features}, Classes: {dataset.num_classes}")
Core Concepts
Data Structure
PyG represents graphs using the torch_geometric.data.Data class with these key attributes:
data.x: Node feature matrix[num_nodes, num_node_features]data.edge_index: Graph connectivity in COO format[2, num_edges]data.edge_attr: Edge feature matrix[num_edges, num_edge_features](optional)data.y: Target labels for nodes or graphsdata.pos: Node spatial positions[num_nodes, num_dimensions](optional)- Custom attributes: Can add any attribute (e.g.,
data.train_mask,data.batch)
Important: These attributes are not mandatory—extend Data objects with custom attributes as needed.
Edge Index Format
Edges are stored in COO (coordinate) format as a [2, num_edges] tensor:
- First row: source node indices
- Second row: target node indices
# Edge list: (0→1), (1→0), (1→2), (2→1)
edge_index = torch.tensor([[0, 1, 1, 2],
[1, 0, 2, 1]], dtype=torch.long)
Mini-Batch Processing
PyG handles batching by creating block-diagonal adjacency matrices, concatenating multiple graphs into one large disconnected graph:
- Adjacency matrices are stacked diagonally
- Node features are concatenated along the node dimension
- A
batchvector maps each node to its source graph - No padding needed—computationally efficient
from torch_geometric.loader import DataLoader
loader = DataLoader(dataset, batch_size=32, shuffle=True)
for batch in loader:
print(f"Batch size: {batch.num_graphs}")
print(f"Total nodes: {batch.num_nodes}")
# batch.batch maps nodes to graphs
Building Graph Neural Networks
Message Passing Paradigm
GNNs in PyG follow a neighborhood aggregation scheme:
- Transform node features
- Propagate messages along edges
- Aggregate messages from neighbors
- Update node representations
Using Pre-Built Layers
PyG provides 40+ convolutional layers. Common ones include:
GCNConv (Graph Convolutional Network):
from torch_geometric.nn import GCNConv
import torch.nn.functional as F
class GCN(torch.nn.Module):
def __init__(self, num_features, num_classes):
super().__init__()
self.conv1 = GCNConv(num_features, 16)
self.conv2 = GCNConv(16, num_classes)
def forward(self, data):
x, edge_index = data.x, data.edge_index
x = self.conv1(x, edge_index)
x = F.relu(x)
x = F.dropout(x, training=self.training)
x = self.conv2(x, edge_index)
return F.log_softmax(x, dim=1)
GATConv (Graph Attention Network):
from torch_geometric.nn import GATConv
class GAT(torch.nn.Module):
def __init__(self, num_features, num_classes):
super().__init__()
self.conv1 = GATConv(num_features, 8, heads=8, dropout=0.6)
self.conv2 = GATConv(8 * 8, num_classes, heads=1, concat=False, dropout=0.6)
def forward(self, data):
x, edge_index = data.x, data.edge_index
x = F.dropout(x, p=0.6, training=self.training)
x = F.elu(self.conv1(x, edge_index))
x = F.dropout(x, p=0.6, training=self.training)
x = self.conv2(x, edge_index)
return F.log_softmax(x, dim=1)
GraphSAGE:
from torch_geometric.nn import SAGEConv
class GraphSAGE(torch.nn.Module):
def __init__(self, num_features, num_classes):
super().__init__()
self.conv1 = SAGEConv(num_features, 64)
self.conv2 = SAGEConv(64, num_classes)
def forward(self, data):
x, edge_index = data.x, data.edge_index
x = self.conv1(x, edge_index)
x = F.relu(x)
x = F.dropout(x, training=self.training)
x = self.conv2(x, edge_index)
return F.log_softmax(x, dim=1)
Custom Message Passing Layers
For custom layers, inherit from MessagePassing:
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import add_self_loops, degree
class CustomConv(MessagePassing):
def __init__(self, in_channels, out_channels):
super().__init__(aggr='add') # "add", "mean", or "max"
self.lin = torch.nn.Linear(in_channels, out_channels)
def forward(self, x, edge_index):
# Add self-loops to adjacency matrix
edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))
# Transform node features
x = self.lin(x)
# Compute normalization
row, col = edge_index
deg = degree(col, x.size(0), dtype=x.dtype)
deg_inv_sqrt = deg.pow(-0.5)
norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]
# Propagate messages
return self.propagate(edge_index, x=x, norm=norm)
def message(self, x_j, norm):
# x_j: features of source nodes
return norm.view(-1, 1) * x_j
Key methods:
forward(): Main entry pointmessage(): Constructs messages from source to target nodesaggregate(): Aggregates messages (usually don't override—setaggrparameter)update(): Updates node embeddings after aggregation
Variable naming convention: Appending _i or _j to tensor names automatically maps them to target or source nodes.
Working with Datasets
Loading Built-in Datasets
PyG provides extensive benchmark datasets:
# Citation networks (node classification)
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='/tmp/Cora', name='Cora') # or 'CiteSeer', 'PubMed'
# Graph classification
from torch_geometric.datasets import TUDataset
dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')
# Molecular datasets
from torch_geometric.datasets import QM9
dataset = QM9(root='/tmp/QM9')
# Large-scale datasets
from torch_geometric.datasets import Reddit
dataset = Reddit(root='/tmp/Reddit')
Check references/datasets_reference.md for a comprehensive list.
Creating Custom Datasets
For datasets that fit in memory, inherit