Datamol Cheminformatics Skill
Overview
Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native rdkit.Chem.Mol instances, ensuring full compatibility with the RDKit ecosystem.
Version note: Examples target datamol 0.12.x (PyPI stable: 0.12.5, June 2024). Since 0.10.0, modules are lazy-loaded by default (set DATAMOL_DISABLE_LAZY_LOADING=1 to disable). Since 0.12.2, RDKit is a direct PyPI dependency of datamol. Fingerprints use RDKit's rdFingerprintGenerator API (0.12.5+).
Key capabilities:
- Molecular format conversion (SMILES, SELFIES, InChI)
- Structure standardization and sanitization
- Molecular descriptors and fingerprints
- 3D conformer generation and analysis
- Clustering and diversity selection
- Scaffold and fragment analysis
- Chemical reaction application
- Visualization and alignment
- Batch processing with parallelization
- Cloud storage support via fsspec
Installation and Setup
Guide users to install datamol:
uv pip install datamol
RDKit is installed automatically with datamol. For remote file paths (S3, GCS, HTTP), install the matching fsspec backend:
uv pip install s3fs # AWS S3
uv pip install gcsfs # Google Cloud Storage
Import convention:
import datamol as dm
Core Workflows
1. Basic Molecule Handling
Creating molecules from SMILES:
import datamol as dm
# Single molecule
mol = dm.to_mol("CCO") # Ethanol
# From list of SMILES
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
mols = [dm.to_mol(smi) for smi in smiles_list]
# Error handling
mol = dm.to_mol("invalid_smiles") # Returns None
if mol is None:
print("Failed to parse SMILES")
Converting molecules to SMILES:
# Canonical SMILES
smiles = dm.to_smiles(mol)
# Isomeric SMILES (includes stereochemistry)
smiles = dm.to_smiles(mol, isomeric=True)
# Other formats
inchi = dm.to_inchi(mol)
inchikey = dm.to_inchikey(mol)
selfies = dm.to_selfies(mol)
Standardization and sanitization (always recommend for user-provided molecules):
# Sanitize molecule
mol = dm.sanitize_mol(mol)
# Full standardization (recommended for datasets)
mol = dm.standardize_mol(
mol,
disconnect_metals=True,
normalize=True,
reionize=True
)
# For SMILES strings directly
clean_smiles = dm.standardize_smiles(smiles)
2. Reading and Writing Molecular Files
Refer to references/io_module.md for comprehensive I/O documentation.
Reading files:
# SDF files (most common in chemistry)
df = dm.read_sdf("compounds.sdf", mol_column='mol')
# SMILES files
df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')
# CSV with SMILES column
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
# Excel files
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
# Universal reader/writer (auto-detects format; supports compression)
df = dm.open_df("file.sdf") # .sdf, .csv, .xlsx, .parquet, .json, .gz, etc.
dm.save_df(df, "output.parquet")
Writing files:
# Save as SDF
dm.to_sdf(mols, "output.sdf")
# Or from DataFrame
dm.to_sdf(df, "output.sdf", mol_column="mol")
# Save as SMILES file
dm.to_smi(mols, "output.smi")
# Excel with rendered molecule images
dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])
Remote file support (S3, GCS, HTTP via fsspec):
Only use cloud paths when the user explicitly requests them. Confirm the destination before writing.
# Read from cloud storage or HTTPS (user-provided URLs only)
df = dm.read_sdf("s3://bucket/compounds.sdf")
df = dm.read_csv("https://example.com/data.csv")
# Write to cloud storage — confirm path with user first
dm.to_sdf(mols, "s3://bucket/output.sdf")
Cloud backends read credentials from the standard provider environment (for example AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, or GOOGLE_APPLICATION_CREDENTIALS). Datamol passes these to fsspec locally; it does not collect or transmit environment variables to third-party endpoints. Scope credential access to the named provider variables only.
3. Molecular Descriptors and Properties
Refer to references/descriptors_viz.md for detailed descriptor documentation.
Computing descriptors for a single molecule:
# Get standard descriptor set
descriptors = dm.descriptors.compute_many_descriptors(mol)
# Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1,
# 'tpsa': 20.23, 'n_aromatic_atoms': 0, ...}
Batch descriptor computation (recommended for datasets):
# Compute for all molecules in parallel
desc_df = dm.descriptors.batch_compute_many_descriptors(
mols,
n_jobs=-1, # Use all CPU cores
progress=True # Show progress bar
)
Specific descriptors:
# Aromaticity
n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
# Stereochemistry
n_stereo = dm.descriptors.n_stereo_centers(mol)
n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)
# Flexibility
n_rigid = dm.descriptors.n_rigid_bonds(mol)
Drug-likeness filtering (Lipinski's Rule of Five):
# Filter compounds
def is_druglike(mol):
desc = dm.descriptors.compute_many_descriptors(mol)
return (
desc['mw'] <= 500 and
desc['logp'] <= 5 and
desc['hbd'] <= 5 and
desc['hba'] <= 10
)
druglike_mols = [mol for mol in mols if is_druglike(mol)]
4. Molecular Fingerprints and Similarity
Generating fingerprints:
Datamol defaults to ECFP6 (radius=3, n_bits=2048). Pass radius=2 explicitly for ECFP4.
# ECFP4 (common in similarity screening)
fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
# Other fingerprint types
fp_maccs = dm.to_fp(mol, fp_type='maccs')
fp_topological = dm.to_fp(mol, fp_type='topological')
fp_atompair = dm.to_fp(mol, fp_type='atompair')
fp_rdkit = dm.to_fp(mol, fp_type='rdkit')
Similarity calculations:
# Pairwise distances within a set
distance_matrix = dm.pdist(mols, n_jobs=-1)
# Distances between two sets
distances = dm.cdist(query_mols, library_mols, n_jobs=-1)
# Find most similar molecules (scipy is a PyPI package, not a file in this skill)
from scipy.spatial.distance import squareform # third-party library
dist_matrix = squareform(dm.pdist(mols))
# Lower distance = higher similarity (Tanimoto distance = 1 - Tanimoto similarity)
5. Clustering and Diversity Selection
Refer to references/core_api.md for clustering details.
Butina clustering:
# Cluster molecules by structural similarity
clusters = dm.cluster_mols(
mols,
cutoff=0.2, # Tanimoto distance threshold (0=identical, 1=completely different)
n_jobs=-1 # Parallel processing
)
# Each cluster is a list of molecule indices
for i, cluster in enumerate(clusters):
print(f"Cluster {i}: {len(cluster)} molecules")
cluster_mols = [mols[idx] for idx in cluster]
Important: Butina clustering builds a full distance matrix - suitable for ~1000 molecules, not for 10,000+.
Diversity selection:
# Pick diverse subset
diverse_mols = dm.pick_diverse(
mols,
npick=100 # Select 100 diverse molecules
)
# Pick cluster centroids
centroids = dm.pick_centroids(
mols,
npick=50 # Select 50 representative molecules
)
6. Scaffold Analysis
Refer to references/fragments_scaffolds.md for complete scaffold documentation.
Extracting Murcko scaffolds:
# Get Bemis-Murcko scaffold (core structure)
scaffold = dm.to_scaffold_murcko(mol)
scaffold_smiles = dm.to_smiles(scaffold)
Scaffold-based analysis:
# Group compounds by scaffold
from coll