Experimentation Platform Orchestrator
A senior product and engineering leader's playbook for making the experimentation platform decision and recovering from making it wrong.
Picking an experimentation platform is one of those decisions that looks easy at the start and compounds for years afterward. The wrong choice costs you in lost experiments (because the team avoids the painful workflow), in cost (because the wrong pricing model penalizes your usage shape), in vendor lock-in (because migration is real engineering work, not a config change), and in cultural drift (because the platform's defaults shape what your team thinks experimentation is).
This skill is the discipline that makes the decision well the first time and the migration plan when you didn't.
When to use this skill: choosing a platform from scratch, evaluating whether to switch, deciding whether to consolidate from multi-platform to single, or planning a migration that has already been approved.
What this skill is for
This skill spans platform selection, multi-platform decisions, migration planning, and governance setup. It does not cover experiment design (use experiment-design), result interpretation (use experimentation-analytics), or feature flag operations (use feature-flagging). Pair this skill with the relevant integrations microsite when you need platform-specific MCP details.
The audience is a PM, engineering leader, or data lead who is making the decision or recovering from a previous one. The voice is decisive. There is no "it depends, evaluate them all yourself." The decision space has real shape, and a senior advisor can map your situation to a defensible answer in an afternoon.
The 7 considerations for the platform decision
Every platform evaluation walks the same seven questions. Answer them honestly first, then read the per-platform profiles, then consult the decision matrix. The order matters: data architecture and statistical rigor are foundational; the rest are layered on top.
-
Data architecture. Where does experiment data live? Three patterns. Vendor-native (Statsig, Optimizely) keeps the data in the vendor's storage. Product-suite (PostHog, Amplitude) combines analytics and experiments behind one event pipeline. Warehouse-native (GrowthBook, Eppo) runs SQL on your existing data warehouse. The pattern dictates security review depth, residency, statistical depth, and cost shape.
-
Statistical rigor. Does the platform implement CUPED, sequential testing, the delta method for ratio metrics, and multiple testing corrections? Cheap to verify in a sales call: ask "what variance estimator do you use for ratio metrics?" and "do you support always-valid p-values?" Modern platforms (Statsig, Eppo, parts of PostHog) have these. Older or homegrown platforms often do not.
-
MCP availability. All seven platforms covered here have a first-party or hosted MCP except Eppo (as of May 2026). MCP availability matters more for agentic workflows where AI agents create and read experiments end to end. It matters less for traditional human-driven experimentation. Worth weighting if your team is AI-forward.
-
Feature flag integration. Do experiments and feature flags live in the same platform? Statsig, Optimizely, GrowthBook, and PostHog all unify them. Eppo is experiment-only. Kameleoon is creative-personalization-focused. If you also need feature flag operations as production infrastructure, check
feature-flaggingfor the operational discipline; the platform choice has to support both surfaces or you accept a second tool. -
Analytics depth. Can you see funnels, retention, and cohorts in the same surface as experiment results? PostHog and Amplitude are strongest here (analytics-first products). Statsig has a strong analytics overlay. Optimizely and GrowthBook are experiment-first, with analytics as a supplementary feature.
-
Governance and audit. Who can change targeting in production, who can ship experiments, who can read sensitive metrics? Enterprise tiers (Optimizely, LaunchDarkly Federal) handle this with maturity. Open-source platforms (GrowthBook, PostHog self-hosted) require self-built governance. For regulated industries (healthcare, finance, public sector), this question is the deciding factor.
-
Cost shape. Vendor-native scales with events. Warehouse-native scales with seats and warehouse compute. Product-suite scales with combined event volume across all features. Match the pricing shape to your actual usage shape. A high-traffic startup pays vendor-native pricing differently from a lower-traffic enterprise; pick the shape that is friendly to your trajectory, not just your current month.
Statsig
Modern experimentation and feature management combined in one platform. CUPED and sequential testing built in. Strong PM-led ergonomics. Used by OpenAI, Notion, Brex, Figma.
Strengths. Fast time to first experiment. Combined experiments and feature flags eliminate the second-tool tax. Statistical rigor is current with the literature. The MCP exposes full CRUD across experiments, gates, dynamic configs, and metrics.
Gotchas. Pricing scales with events, which can become expensive at high scale. The platform has strong opinions about how experiments should run; teams that want a custom statistical workflow will fight the defaults. Self-host is not a first-class option.
Ideal customer. Fast-growing SaaS that wants one platform for flags and experiments, values out-of-the-box statistical depth, and is comfortable with vendor-native data architecture.
PostHog
Open-source product OS combining product analytics, experiments, feature flags, surveys, session replays, error tracking, and LLM analytics. Free tier available.
Strengths. Full-funnel context. Experiments live next to the analytics that contextualize them. The MCP exposes 200+ tools (use scoping like ?features= to keep the agent context tight). Self-host option is mature. Open-source license keeps you outside the vendor lock-in trap.
Gotchas. Combined event volume across all features can make pricing surprising at scale. The breadth of features can make onboarding feel busy. Statistical depth in experiments is good but a step behind dedicated experiment platforms (Statsig, Eppo) on advanced features like CUPED defaults.
Ideal customer. Product-led-growth SaaS that wants analytics, experiments, and feature flags in one surface, values open-source flexibility, and is comfortable with the breadth.
GrowthBook
Open-source warehouse-native experimentation. Data stays in your warehouse (Snowflake, BigQuery, Redshift, Postgres). Self-host or cloud.
Strengths. Data sovereignty. Cost control at scale because the warehouse is already paid for. First open-source production MCP for experimentation per their announcement. Mature statistical defaults including Bayesian and frequentist toggles. Bring-your-own metrics from the warehouse means experiment metrics match analytics metrics by definition.
Gotchas. Setup overhead is higher than vendor-native. You own the warehouse compute cost (this is usually a feature, not a bug, but it shows up in different finance line items). Smaller community than Statsig or PostHog. UI polish lags vendor platforms by half a step.
Ideal customer. Data-mature teams that already run a warehouse, regulated industries that need data residency, and teams that prefer open-source for the exit option.
Optimizely
Long-time enterprise leader. Web Experimentation plus Feature Experimentation. Strong personalization and visual editing for non-technical users.
Strengths. Enterprise governance is mature. Visual editing lets marketers ship experiments without engineer help. Strong customer success organization. Hosted Remote MCP works in browser-based ChatGPT and Claude.ai.
Gotchas. Expensive. Pricing targets marketing budgets, not engineering budgets. The produ