When this skill is activated, always start your first response with the 🧢 emoji.
Data Science
A practitioner's guide for exploratory data analysis, statistical inference, and predictive modeling. Covers the full analytical workflow - from raw data to reproducible conclusions - with an emphasis on when to apply each technique, not just how. Designed for engineers and analysts who can code but need opinionated guidance on statistical rigor and common traps.
When to use this skill
Trigger this skill when the user:
- Loads a new dataset and wants to understand its structure and distributions
- Needs to clean, reshape, or impute missing data in a pandas DataFrame
- Runs a hypothesis test (t-test, chi-square, ANOVA, Mann-Whitney)
- Analyzes an A/B test or experiment result for statistical significance
- Builds a correlation matrix or investigates feature relationships
- Plots distributions, trends, or model diagnostics with matplotlib or seaborn
- Engineers features for a machine learning model
- Fits a linear or logistic regression and needs to interpret coefficients
- Calculates confidence intervals, p-values, or effect sizes
- Needs to choose the right statistical test for their data type
Do NOT trigger this skill for:
- Deep learning / neural network architecture (use an ML engineering skill)
- Data engineering pipelines, ETL, or streaming (use a data engineering skill)
Key principles
-
Visualize before modeling - Plot every variable before fitting anything. Distributions, outliers, and relationships invisible in summary statistics leap out in charts. A histogram takes 2 seconds; debugging a model trained on bad assumptions takes days.
-
Check your assumptions - Every statistical test has assumptions (normality, equal variance, independence). Violating them silently produces misleading results. Run the assumption check first, then choose the test.
-
Correlation is not causation - A strong correlation between X and Y might mean X causes Y, Y causes X, a third variable Z causes both, or pure coincidence. Never state causation from observational data without a causal framework.
-
Validate on holdout data - Any model evaluated on the same data it was trained on is measuring memorization, not learning. Always split before fitting; never peek at the test set to tune parameters.
-
Reproducible notebooks - Set random seeds (
np.random.seed,random_state), pin library versions, and document every data transformation in order. A result you cannot reproduce is not a result.
Core concepts
Distributions describe how values are spread: normal (bell curve), skewed, bimodal, uniform. Knowing the shape tells you which statistics are meaningful (mean vs. median) and which tests are valid.
Central Limit Theorem - the mean of a large enough sample is approximately normally distributed regardless of the population distribution. This is why t-tests work on non-normal data with n > 30.
p-values measure the probability of observing your data (or more extreme) if the null hypothesis were true. They do NOT measure the probability the null is true, the effect size, or practical significance. A p-value < 0.05 is a threshold, not a truth detector.
Confidence intervals give the range of plausible values for a parameter. A 95% CI means: if you repeated the experiment 100 times, ~95 intervals would contain the true value. Always report CIs alongside p-values - a significant result with a CI spanning near-zero means the effect is tiny.
Bias-variance tradeoff - underfitting (high bias) means the model is too simple to capture the signal; overfitting (high variance) means it captures noise too. Cross-validation is the primary tool for diagnosing which problem you have.
Common tasks
EDA workflow
Load data and profile it systematically before any analysis:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("data.csv")
# Shape, types, missing values
print(df.shape)
print(df.dtypes)
print(df.isnull().sum().sort_values(ascending=False))
# Numeric summary
print(df.describe())
# Categorical value counts
for col in df.select_dtypes("object"):
print(f"\n{col}:\n{df[col].value_counts().head(10)}")
# Distribution of each numeric feature
df.hist(bins=30, figsize=(14, 10))
plt.tight_layout()
plt.show()
# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
df.select_dtypes("number").corr(),
annot=True, fmt=".2f", cmap="coolwarm", center=0
)
plt.show()
Always check
df.duplicated().sum()anddf.dtypes- columns that should be numeric but areobjecttype signal parsing issues or mixed data.
Data cleaning pipeline
Build a repeatable cleaning function rather than inline mutations:
def clean_dataframe(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy() # Never mutate the original
# 1. Standardize column names
df.columns = df.columns.str.lower().str.replace(r"\s+", "_", regex=True)
# 2. Drop duplicates
df = df.drop_duplicates()
# 3. Handle missing values
numeric_cols = df.select_dtypes("number").columns
categorical_cols = df.select_dtypes("object").columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
df[categorical_cols] = df[categorical_cols].fillna("unknown")
# 4. Remove outliers (IQR method - only for numeric targets)
for col in numeric_cols:
q1, q3 = df[col].quantile([0.25, 0.75])
iqr = q3 - q1
df = df[(df[col] >= q1 - 1.5 * iqr) & (df[col] <= q3 + 1.5 * iqr)]
return df
The
df.copy()guard is critical. Pandas operations on slices can silently modify the original viaSettingWithCopyWarning. Always copy first.
Hypothesis testing
Choose the test based on data type and group count (see
references/statistical-tests.md), then check assumptions:
from scipy import stats
# Independent samples t-test (two groups, continuous outcome)
group_a = df[df["variant"] == "control"]["revenue"]
group_b = df[df["variant"] == "treatment"]["revenue"]
# Check normality (Shapiro-Wilk - only reliable for n < 5000)
_, p_norm_a = stats.shapiro(group_a.sample(min(len(group_a), 500)))
_, p_norm_b = stats.shapiro(group_b.sample(min(len(group_b), 500)))
print(f"Normality p-values: A={p_norm_a:.4f}, B={p_norm_b:.4f}")
# If p_norm < 0.05 on small samples, prefer Mann-Whitney U
if p_norm_a < 0.05 or p_norm_b < 0.05:
stat, p_value = stats.mannwhitneyu(group_a, group_b, alternative="two-sided")
print(f"Mann-Whitney U: stat={stat:.2f}, p={p_value:.4f}")
else:
stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"t-test: t={stat:.2f}, p={p_value:.4f}")
# Effect size (Cohen's d)
pooled_std = np.sqrt((group_a.std() ** 2 + group_b.std() ** 2) / 2)
cohens_d = (group_b.mean() - group_a.mean()) / pooled_std
print(f"Cohen's d: {cohens_d:.3f}") # < 0.2 small, 0.5 medium, > 0.8 large
# Chi-square test for categorical outcomes
contingency = pd.crosstab(df["variant"], df["converted"])
chi2, p_chi2, dof, expected = stats.chi2_contingency(contingency)
print(f"Chi-square: chi2={chi2:.2f}, p={p_chi2:.4f}, dof={dof}")
A/B test analysis with sample size planning
Always calculate required sample size before running an experiment:
from statsmodels.stats.power import TTestIndPower, NormalIndPower
from statsmodels.stats.proportion import proportions_ztest
# Sample size for conversion rate test
# effect_size = (p2 - p1) / sqrt(p_pooled * (1 - p_pooled))
baseline_rate = 0.05 # current conversion
minimum_detectable = 0.01 # smallest change worth detecting
alpha = 0.05 # false positive rate
power = 0.80 # 1 - false negative rate
p1, p2 = baseline_rate, baseline_rate + minimum_detectable
p_pool = (p1 + p2) / 2
effect_size = (p2 - p1) / np.sqrt(p_pool * (1 - p_pool))
analysis = No