A/B Testing Framework
Design, run, and analyze conversion experiments with statistical rigor.
Install
git clone https://github.com/thatrebeccarae/claude-marketing.git && cp -r claude-marketing/skills/ab-testing-framework ~/.claude/skills/
Test Design Process
Step 1: Hypothesis
Template: If we [change X], then [metric Y] will [increase/decrease] by [Z%] because [reason].
Good hypothesis: "If we change the CTA from Get Started to Start Free Trial, then signup rate will increase by 15% because it reduces uncertainty about cost."
Bad hypothesis: "If we change the button color, conversions will improve." (No reasoning, no expected magnitude.)
Step 2: Sample Size Calculation
To determine how long to run a test:
Required sample per variation = 16 * (p * (1-p)) / (MDE^2)
Where:
p = baseline conversion rate (as decimal)
MDE = minimum detectable effect (as decimal)
| Baseline Rate | 10% MDE | 20% MDE | 30% MDE |
|---|---|---|---|
| 1% | 253,414 | 63,354 | 28,157 |
| 3% | 82,369 | 20,592 | 9,152 |
| 5% | 48,640 | 12,160 | 5,404 |
| 10% | 23,040 | 5,760 | 2,560 |
| 20% | 10,240 | 2,560 | 1,138 |
Minimum test duration: 2 full business weeks (to capture day-of-week effects), even if sample size is reached sooner.
Step 3: Test Execution Rules
- Random assignment — visitors must be randomly assigned to control/variant
- No peeking — do not check results before reaching sample size
- No mid-test changes — do not modify variants during the test
- Even traffic split — 50/50 for A/B, even splits for multivariate
- Single variable — change only one thing per test (unless multivariate)
- Full duration — run for the pre-calculated duration, not until significance
Step 4: Statistical Analysis
Frequentist Approach
Z-test for proportions:
Z = (p1 - p2) / sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
Where:
p1, p2 = conversion rates of control and variant
p_pooled = (x1 + x2) / (n1 + n2)
n1, n2 = sample sizes
p-value interpretation:
- p < 0.05: Statistically significant (95% confidence)
- p < 0.01: Highly significant (99% confidence)
- p >= 0.05: Not significant — do not declare a winner
Bayesian Approach
When to use Bayesian:
- Low traffic (small sample sizes)
- Need to make decisions faster
- Want probability of each variant being best (not just "significant or not")
Interpretation: "There is a 94% probability that Variant B is better than Control" vs frequentist "We reject the null hypothesis at 95% confidence."
Step 5: Decision Framework
| Result | Significance | Action |
|---|---|---|
| Variant wins | p < 0.05 | Implement variant |
| Control wins | p < 0.05 | Keep control, learn from failure |
| No difference | p >= 0.05 | Keep control, test something bigger |
| Variant wins | p = 0.05-0.10 | Consider traffic — may need more time |
Common Testing Pitfalls
- Peeking — checking results early inflates false positive rate from 5% to 26%+
- Stopping early — reaching significance != reaching required sample size
- Testing too many variants — each variant needs full sample size
- Ignoring segments — overall winner may be loser for key segments
- Too small an effect — testing for 2% lift needs enormous sample sizes
- Not accounting for seasonality — run full weeks, avoid holidays
- Multiple metrics — primary metric must be pre-declared; secondary are directional
- Survivorship bias — only measuring users who complete, not those who abandon
- Simpson paradox — segment-level winners can reverse at aggregate level
- Novelty effect — new designs get temporary lift; re-test after 2-4 weeks
What to Test (Prioritized by Impact)
High Impact
- Value proposition / headline
- CTA text and placement
- Pricing and offer structure
- Form length (fields removed)
- Page layout (single column vs multi)
- Social proof presence and placement
Medium Impact
- Image/video vs static
- Testimonial format (text vs video)
- Navigation presence on landing pages
- Trust badges and security signals
- Urgency elements (countdown, stock)
Low Impact (Usually Not Worth Testing)
- Button color (unless extreme contrast issue)
- Font changes
- Minor copy tweaks
- Icon styles
- Footer content
Integration with Other Skills
- cro-auditor — CRO audit generates test hypotheses; this skill designs the experiments
- google-analytics — GA4 for experiment data and segment analysis
- copywriting-frameworks — Generate variant copy using proven frameworks