Data Warehouse Experimentation
A senior data scientist's playbook for running experiments natively out of BigQuery, Snowflake, or any modern data warehouse, with metric definitions in dbt and statistical analysis in SQL or Python.
Most companies that run experiments at scale use a dedicated platform. Statsig, Optimizely, LaunchDarkly with experimentation, PostHog, Amplitude Experiment. The platforms are good. They handle assignment, instrumentation, and analysis in one product, and the SQL-savvy data team does not have to reinvent the variance reduction wheel.
There is a different operational model that mature data teams increasingly choose: warehouse-native experimentation. Assignment happens in code or via feature flags. Exposure events fire to the warehouse like any other event. Metrics are defined as dbt models. Statistical analysis runs as SQL or in a Python notebook against warehouse data. The "experiment platform" is just your existing data stack.
This skill covers when warehouse-native is the right call, the architecture, and the specific techniques that make it work: assignment patterns, exposure logging discipline, metric definitions in dbt, t-tests and CUPED in SQL, sequential testing, and the pitfalls that take down homegrown setups.
When to use this skill: deciding between platform vs warehouse-native, building a warehouse-native experiment infrastructure, auditing an existing one, or running a specific experiment when the platform of record cannot handle a custom metric or segmentation.
What this skill is for
This skill spans the operational execution model for warehouse-native experimentation. It does not replace the methodology and interpretation skills; it composes with them.
experiment-designcovers methodology: hypotheses, sample size, randomization unit, primary metric. Tool-agnostic. Read it first to design the experiment correctly regardless of where it runs.experimentation-analyticscovers interpretation: confidence intervals, p-values, effect size, decision frameworks. Tool-agnostic. Read it when results land.experimentation-platform-orchestratorcovers the platform-vs-warehouse decision in detail. Read it to decide whether to use a platform or this skill.feature-flaggingcovers assignment infrastructure when not running through a platform. Read it for the flag-management discipline that this skill assumes.- This skill (
data-warehouse-experimentation) covers the operational execution: SQL-based assignment, exposure logging, metric definitions in dbt, statistical analysis in SQL or Python, variance reduction, sequential testing.
The distinction is between "what to do" (the methodology and interpretation skills) and "how to do it without a vendor platform" (this skill). Read this skill after you have decided warehouse-native is the right call. If you are still deciding, start with experimentation-platform-orchestrator.
When warehouse-native is the right call
Six factors push the decision toward warehouse-native.
- Cost at volume. Platforms charge per MAU or per event. At 10K MAU the platform is cheap; at 1M MAU the bill becomes a real budget item. Warehouse-native runs on infrastructure you already pay for.
- Custom metrics. If your primary metric is a complex business metric (revenue with refund-aware logic, cohort LTV, retention bracket, multi-event composites), platforms can struggle. Warehouse-native expresses any metric you can write in SQL.
- Custom segmentation. Enterprise customers, account-tier crosscuts, complex behavioral segments. Platforms have segmentation features; the depth varies. dbt models compose without limit.
- Trust requirements. Regulated industries (healthcare, finance, government) need full transparency into the math. Warehouse-native gives you every step of the calculation auditable in SQL.
- Existing data team strength. If you have data engineers and data scientists, you have most of the infrastructure. Adding experimentation discipline on top costs less than adopting a new platform.
- Iteration on metric definitions. Platforms ship metric updates on their own cadence. Warehouse-native iterates as fast as your dbt deployments.
Five factors push toward platform.
- Frontend visual experiments. Optimizely's bread and butter. Variant code injected via a script tag, with WYSIWYG editing.
- Sub-week iteration speed. Some platforms set up an experiment in 30 minutes; warehouse-native often takes a day or more for the first run of a new metric pattern.
- Teams without strong data infrastructure. If you do not have a warehouse, dbt, and analysts, do not start with warehouse-native. The platform is the right call.
- Mobile experimentation. SDK-based assignment with offline support is the platform's job, not the warehouse's.
- Out-of-the-box sequential testing with strict guarantees. Statsig and Eppo ship mSPRT with calibrated alpha-spending. Building this in-house is real work.
Detail and a decision tree in references/warehouse-vs-platform-decision.md. Many mature teams use both; warehouse-native for the hard cases, platform for fast iteration on standard experiments.
The architecture
Four components, in order of data flow.
- Assignment. How users get bucketed into variants. Hash function, feature flag, or randomized assignment table.
- Exposure logging. A discrete event fired the first time a user is exposed to the experiment, written to the warehouse like any other event.
- Metric definitions. SQL queries (or dbt models) that compute the primary and secondary metrics from warehouse events.
- Analysis. Statistical computation in SQL or Python that joins exposure to metrics and produces effect estimates with confidence intervals.
The flow. User visits the product. Assignment determines the bucket (control or treatment). If the user is exposed to the variant (sees the treatment-specific behavior), an exposure event fires to the warehouse. The user takes actions, generating metric events to the same warehouse. At analysis time, exposure joins to metrics on the assignment unit (typically user_id); the analysis computes lift and produces a decision.
The exposure-event pattern is critical. Without it you can compute only an "intent-to-treat" analysis (everyone assigned, regardless of whether they saw the variant). With it you compute the "exposed" analysis on the population that actually experienced the variant. The latter is usually what you want, especially when the variant only affects a subset of the assigned users (e.g., users who reached a specific page).
Assignment patterns
Three approaches.
Deterministic hash assignment. The default for warehouse-native.
MOD(ABS(FARM_FINGERPRINT(CONCAT(user_id, 'exp_button_color_v1'))), 100) < 50
The salt ('exp_button_color_v1') ensures different experiments produce uncorrelated assignments for the same user. Reproducible (same input always produces the same bucket), no service dependency, salt isolation across experiments. The assignment can be computed inline in any SQL query.
Feature flag assignment. Rely on a feature flag service (LaunchDarkly, Statsig flags, Unleash, internal) to do bucketing; the warehouse just records the assignment that the flag service chose.
-- Read assignment from the flag service's logs
SELECT user_id, variant_id, assigned_at
FROM flag_service.assignments
WHERE flag_key = 'exp_button_color_v1'
This works when the flag service is the source of truth for assignment and the warehouse mirrors the assignment table. Useful when assignment must respect flag-service rules (e.g., percentage rollouts, targeting rules) that are inconvenient to replicate in SQL.
Randomized assignment table. Pre-randomize users into a table at experiment start.
CREATE TABLE exp_button_color_v1_assignments AS
SELECT
use