RDKit Cheminformatics Toolkit
Overview
RDKit is a comprehensive cheminformatics library providing Python APIs for molecular analysis and manipulation. This skill provides guidance for reading/writing molecular structures, calculating descriptors, fingerprinting, substructure searching, chemical reactions, 2D/3D coordinate generation, and molecular visualization. Use this skill for drug discovery, computational chemistry, and cheminformatics research tasks.
Core Capabilities
1. Molecular I/O and Creation
Reading Molecules:
Read molecular structures from various formats:
from rdkit import Chem
# From SMILES strings
mol = Chem.MolFromSmiles('Cc1ccccc1') # Returns Mol object or None
# From MOL files
mol = Chem.MolFromMolFile('path/to/file.mol')
# From MOL blocks (string data)
mol = Chem.MolFromMolBlock(mol_block_string)
# From InChI
mol = Chem.MolFromInchi('InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H')
Writing Molecules:
Convert molecules to text representations:
# To canonical SMILES
smiles = Chem.MolToSmiles(mol)
# To MOL block
mol_block = Chem.MolToMolBlock(mol)
# To InChI
inchi = Chem.MolToInchi(mol)
Batch Processing:
For processing multiple molecules, use Supplier/Writer objects:
# Read SDF files
suppl = Chem.SDMolSupplier('molecules.sdf')
for mol in suppl:
if mol is not None: # Check for parsing errors
# Process molecule
pass
# Read SMILES files
suppl = Chem.SmilesMolSupplier('molecules.smi', titleLine=False)
# For large files or compressed data
with gzip.open('molecules.sdf.gz') as f:
suppl = Chem.ForwardSDMolSupplier(f)
for mol in suppl:
# Process molecule
pass
# Multithreaded processing for large datasets
suppl = Chem.MultithreadedSDMolSupplier('molecules.sdf')
# Write molecules to SDF
writer = Chem.SDWriter('output.sdf')
for mol in molecules:
writer.write(mol)
writer.close()
Important Notes:
- All
MolFrom*functions returnNoneon failure with error messages - Always check for
Nonebefore processing molecules - Molecules are automatically sanitized on import (validates valence, perceives aromaticity)
2. Molecular Sanitization and Validation
RDKit automatically sanitizes molecules during parsing, executing 13 steps including valence checking, aromaticity perception, and chirality assignment.
Sanitization Control:
# Disable automatic sanitization
mol = Chem.MolFromSmiles('C1=CC=CC=C1', sanitize=False)
# Manual sanitization
Chem.SanitizeMol(mol)
# Detect problems before sanitization
problems = Chem.DetectChemistryProblems(mol)
for problem in problems:
print(problem.GetType(), problem.Message())
# Partial sanitization (skip specific steps)
from rdkit.Chem import rdMolStandardize
Chem.SanitizeMol(mol, sanitizeOps=Chem.SANITIZE_ALL ^ Chem.SANITIZE_PROPERTIES)
Common Sanitization Issues:
- Atoms with explicit valence exceeding maximum allowed will raise exceptions
- Invalid aromatic rings will cause kekulization errors
- Radical electrons may not be properly assigned without explicit specification
3. Molecular Analysis and Properties
Accessing Molecular Structure:
# Iterate atoms and bonds
for atom in mol.GetAtoms():
print(atom.GetSymbol(), atom.GetIdx(), atom.GetDegree())
for bond in mol.GetBonds():
print(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType())
# Ring information
ring_info = mol.GetRingInfo()
ring_info.NumRings()
ring_info.AtomRings() # Returns tuples of atom indices
# Check if atom is in ring
atom = mol.GetAtomWithIdx(0)
atom.IsInRing()
atom.IsInRingSize(6) # Check for 6-membered rings
# Find smallest set of smallest rings (SSSR)
from rdkit.Chem import GetSymmSSSR
rings = GetSymmSSSR(mol)
Stereochemistry:
# Find chiral centers
from rdkit.Chem import FindMolChiralCenters
chiral_centers = FindMolChiralCenters(mol, includeUnassigned=True)
# Returns list of (atom_idx, chirality) tuples
# Assign stereochemistry from 3D coordinates
from rdkit.Chem import AssignStereochemistryFrom3D
AssignStereochemistryFrom3D(mol)
# Check bond stereochemistry
bond = mol.GetBondWithIdx(0)
stereo = bond.GetStereo() # STEREONONE, STEREOZ, STEREOE, etc.
Fragment Analysis:
# Get disconnected fragments
frags = Chem.GetMolFrags(mol, asMols=True)
# Fragment on specific bonds
from rdkit.Chem import FragmentOnBonds
frag_mol = FragmentOnBonds(mol, [bond_idx1, bond_idx2])
# Count ring systems
from rdkit.Chem.Scaffolds import MurckoScaffold
scaffold = MurckoScaffold.GetScaffoldForMol(mol)
4. Molecular Descriptors and Properties
Basic Descriptors:
from rdkit.Chem import Descriptors
# Molecular weight
mw = Descriptors.MolWt(mol)
exact_mw = Descriptors.ExactMolWt(mol)
# LogP (lipophilicity)
logp = Descriptors.MolLogP(mol)
# Topological polar surface area
tpsa = Descriptors.TPSA(mol)
# Number of hydrogen bond donors/acceptors
hbd = Descriptors.NumHDonors(mol)
hba = Descriptors.NumHAcceptors(mol)
# Number of rotatable bonds
rot_bonds = Descriptors.NumRotatableBonds(mol)
# Number of aromatic rings
aromatic_rings = Descriptors.NumAromaticRings(mol)
Batch Descriptor Calculation:
# Calculate all descriptors at once
all_descriptors = Descriptors.CalcMolDescriptors(mol)
# Returns dictionary: {'MolWt': 180.16, 'MolLogP': 1.23, ...}
# Get list of available descriptor names
descriptor_names = [desc[0] for desc in Descriptors._descList]
Lipinski's Rule of Five:
# Check drug-likeness
mw = Descriptors.MolWt(mol) <= 500
logp = Descriptors.MolLogP(mol) <= 5
hbd = Descriptors.NumHDonors(mol) <= 5
hba = Descriptors.NumHAcceptors(mol) <= 10
is_drug_like = mw and logp and hbd and hba
5. Fingerprints and Molecular Similarity
Fingerprint Types:
from rdkit.Chem import AllChem, RDKFingerprint
from rdkit.Chem.AtomPairs import Pairs, Torsions
from rdkit.Chem import MACCSkeys
# RDKit topological fingerprint
fp = Chem.RDKFingerprint(mol)
# Morgan fingerprints (circular fingerprints, similar to ECFP)
fp = AllChem.GetMorganFingerprint(mol, radius=2)
fp_bits = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
# MACCS keys (166-bit structural key)
fp = MACCSkeys.GenMACCSKeys(mol)
# Atom pair fingerprints
fp = Pairs.GetAtomPairFingerprint(mol)
# Topological torsion fingerprints
fp = Torsions.GetTopologicalTorsionFingerprint(mol)
# Avalon fingerprints (if available)
from rdkit.Avalon import pyAvalonTools
fp = pyAvalonTools.GetAvalonFP(mol)
Similarity Calculation:
from rdkit import DataStructs
# Calculate Tanimoto similarity
fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, radius=2)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, radius=2)
similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
# Calculate similarity for multiple molecules
similarities = DataStructs.BulkTanimotoSimilarity(fp1, [fp2, fp3, fp4])
# Other similarity metrics
dice = DataStructs.DiceSimilarity(fp1, fp2)
cosine = DataStructs.CosineSimilarity(fp1, fp2)
Clustering and Diversity:
# Butina clustering based on fingerprint similarity
from rdkit.ML.Cluster import Butina
# Calculate distance matrix
dists = []
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]
for i in range(len(fps)):
sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i])
dists.extend([1-sim for sim in sims])
# Cluster with distance cutoff
clusters = Butina.ClusterData(dists, len(fps), distThresh=0.3, isDistData=True)
6. Substructure Searching and SMARTS
Basic Substructure Matching:
# Define query using SMARTS
query = Chem.MolFromSmarts('[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1') # Benzene ring
# Check if molecule contains substructure
has_match = mol.HasSubstructMatch(query)
# Get all matches (returns tuple of tuples with atom indices)
matches = mol.GetSubstructMatches(query)
# Get only first