RDKit Cheminformatics Toolkit
Overview
RDKit is a comprehensive cheminformatics library providing Python APIs for molecular analysis and manipulation. This skill provides guidance for reading/writing molecular structures, calculating descriptors, fingerprinting, substructure searching, chemical reactions, 2D/3D coordinate generation, and molecular visualization. Use this skill for drug discovery, computational chemistry, and cheminformatics research tasks.
Core Capabilities
1. Molecular I/O and Creation
Reading Molecules:
Read molecular structures from various formats:
from rdkit import Chem
# From SMILES strings
mol = Chem.MolFromSmiles('Cc1ccccc1') # Returns Mol object or None
# From MOL files
mol = Chem.MolFromMolFile('path/to/file.mol')
# From MOL blocks (string data)
mol = Chem.MolFromMolBlock(mol_block_string)
# From InChI
mol = Chem.MolFromInchi('InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H')
Writing Molecules:
Convert molecules to text representations:
# To canonical SMILES
smiles = Chem.MolToSmiles(mol)
# To MOL block
mol_block = Chem.MolToMolBlock(mol)
# To InChI
inchi = Chem.MolToInchi(mol)
Batch Processing:
For processing multiple molecules, use Supplier/Writer objects:
# Read SDF files
suppl = Chem.SDMolSupplier('molecules.sdf')
for mol in suppl:
if mol is not None: # Check for parsing errors
# Process molecule
pass
# Read SMILES files
suppl = Chem.SmilesMolSupplier('molecules.smi', titleLine=False)
# For large files or compressed data
import gzip
with gzip.open('molecules.sdf.gz') as f:
suppl = Chem.ForwardSDMolSupplier(f)
for mol in suppl:
# Process molecule
pass
# Multithreaded processing for large datasets
suppl = Chem.MultithreadedSDMolSupplier('molecules.sdf')
# Write molecules to SDF
writer = Chem.SDWriter('output.sdf')
for mol in molecules:
writer.write(mol)
writer.close()
Important Notes:
- All
MolFrom*functions returnNoneon failure with error messages - Always check for
Nonebefore processing molecules - Molecules are automatically sanitized on import (validates valence, perceives aromaticity)
2. Molecular Sanitization and Validation
RDKit automatically sanitizes molecules during parsing, executing 13 steps including valence checking, aromaticity perception, and chirality assignment.
Sanitization Control:
# Disable automatic sanitization
mol = Chem.MolFromSmiles('C1=CC=CC=C1', sanitize=False)
# Manual sanitization
Chem.SanitizeMol(mol)
# Detect problems before sanitization
problems = Chem.DetectChemistryProblems(mol)
for problem in problems:
print(problem.GetType(), problem.Message())
# Partial sanitization (skip specific steps)
from rdkit.Chem import rdMolStandardize
Chem.SanitizeMol(mol, sanitizeOps=Chem.SANITIZE_ALL ^ Chem.SANITIZE_PROPERTIES)
Common Sanitization Issues:
- Atoms with explicit valence exceeding maximum allowed will raise exceptions
- Invalid aromatic rings will cause kekulization errors
- Radical electrons may not be properly assigned without explicit specification
3. Molecular Analysis and Properties
Accessing Molecular Structure:
# Iterate atoms and bonds
for atom in mol.GetAtoms():
print(atom.GetSymbol(), atom.GetIdx(), atom.GetDegree())
for bond in mol.GetBonds():
print(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType())
# Ring information
ring_info = mol.GetRingInfo()
ring_info.NumRings()
ring_info.AtomRings() # Returns tuples of atom indices
# Check if atom is in ring
atom = mol.GetAtomWithIdx(0)
atom.IsInRing()
atom.IsInRingSize(6) # Check for 6-membered rings
# Find smallest set of smallest rings (SSSR)
from rdkit.Chem import GetSymmSSSR
rings = GetSymmSSSR(mol)
Stereochemistry:
# Find chiral centers
from rdkit.Chem import FindMolChiralCenters
chiral_centers = FindMolChiralCenters(mol, includeUnassigned=True)
# Returns list of (atom_idx, chirality) tuples
# Assign stereochemistry from 3D coordinates
from rdkit.Chem import AssignStereochemistryFrom3D
AssignStereochemistryFrom3D(mol)
# Check bond stereochemistry
bond = mol.GetBondWithIdx(0)
stereo = bond.GetStereo() # STEREONONE, STEREOZ, STEREOE, etc.
Fragment Analysis:
# Get disconnected fragments
frags = Chem.GetMolFrags(mol, asMols=True)
# Fragment on specific bonds
from rdkit.Chem import FragmentOnBonds
frag_mol = FragmentOnBonds(mol, [bond_idx1, bond_idx2])
# Count ring systems
from rdkit.Chem.Scaffolds import MurckoScaffold
scaffold = MurckoScaffold.GetScaffoldForMol(mol)
4. Molecular Descriptors and Properties
Basic Descriptors:
from rdkit.Chem import Descriptors
# Molecular weight
mw = Descriptors.MolWt(mol)
exact_mw = Descriptors.ExactMolWt(mol)
# LogP (lipophilicity)
logp = Descriptors.MolLogP(mol)
# Topological polar surface area
tpsa = Descriptors.TPSA(mol)
# Number of hydrogen bond donors/acceptors
hbd = Descriptors.NumHDonors(mol)
hba = Descriptors.NumHAcceptors(mol)
# Number of rotatable bonds
rot_bonds = Descriptors.NumRotatableBonds(mol)
# Number of aromatic rings
aromatic_rings = Descriptors.NumAromaticRings(mol)
Batch Descriptor Calculation:
# Calculate all descriptors at once
all_descriptors = Descriptors.CalcMolDescriptors(mol)
# Returns dictionary: {'MolWt': 180.16, 'MolLogP': 1.23, ...}
# Get list of available descriptor names
descriptor_names = [desc[0] for desc in Descriptors._descList]
Lipinski's Rule of Five:
# Check drug-likeness
mw = Descriptors.MolWt(mol) <= 500
logp = Descriptors.MolLogP(mol) <= 5
hbd = Descriptors.NumHDonors(mol) <= 5
hba = Descriptors.NumHAcceptors(mol) <= 10
is_drug_like = mw and logp and hbd and hba
5. Fingerprints and Molecular Similarity
Fingerprint Types:
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import MACCSkeys
# RDKit topological fingerprint
rdk_gen = rdFingerprintGenerator.GetRDKitFPGenerator(minPath=1, maxPath=7, fpSize=2048)
fp = rdk_gen.GetFingerprint(mol)
# Morgan fingerprints (circular fingerprints, similar to ECFP)
# Modern API using rdFingerprintGenerator
morgan_gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fp = morgan_gen.GetFingerprint(mol)
# Count-based fingerprint
fp_count = morgan_gen.GetCountFingerprint(mol)
# MACCS keys (166-bit structural key)
fp = MACCSkeys.GenMACCSKeys(mol)
# Atom pair fingerprints
ap_gen = rdFingerprintGenerator.GetAtomPairGenerator()
fp = ap_gen.GetFingerprint(mol)
# Topological torsion fingerprints
tt_gen = rdFingerprintGenerator.GetTopologicalTorsionGenerator()
fp = tt_gen.GetFingerprint(mol)
# Avalon fingerprints (if available)
from rdkit.Avalon import pyAvalonTools
fp = pyAvalonTools.GetAvalonFP(mol)
Similarity Calculation:
from rdkit import DataStructs
from rdkit.Chem import rdFingerprintGenerator
# Generate fingerprints using generator
mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fp1 = mfpgen.GetFingerprint(mol1)
fp2 = mfpgen.GetFingerprint(mol2)
# Calculate Tanimoto similarity
similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
# Calculate similarity for multiple molecules
fps = [mfpgen.GetFingerprint(m) for m in [mol2, mol3, mol4]]
similarities = DataStructs.BulkTanimotoSimilarity(fp1, fps)
# Other similarity metrics
dice = DataStructs.DiceSimilarity(fp1, fp2)
cosine = DataStructs.CosineSimilarity(fp1, fp2)
Clustering and Diversity:
# Butina clustering based on fingerprint similarity
from rdkit.ML.Cluster import Butina
# Calculate distance matrix
dists = []
mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fps = [mfpgen.GetFingerprint(mol) for mol in mols]
for i in range(len(fps)):
sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i])
dists.extend([1-sim for sim in sims])
# Cluster with distance cutoff
clusters = Butina.