Conjoint Design Expert
Instructions
Worked example (attribute table → power calculation → PAP tier assignment): see
reference/example.md.
1. Attribute Architecture
- Orthogonality: Ensure every attribute is independent of every other attribute to allow for the estimation of causal effects for each component.
- Randomization of Order: Order attributes randomly at the respondent level (not the task level) to prevent "primacy" or "recency" effects while avoiding the cognitive overload of finding information in different orders across tasks (Stantcheva 2023). A specific logical flow may override this if theoretically required.
- D-Optimal Designs: Consider D-optimal or constrained randomization schemes rather than pure randomization. D-optimal designs choose the sets of administered conditions that maximize statistical power and may be preferable when the number of possible attribute combinations is large relative to the sample size (Auspurg & Hinz 2015; Stantcheva 2023).
- Attribute Density: Monitor for respondent fatigue. Stefanelli and Lukac (2020) cite evidence that conjoint results remain stable with up to 10 attributes; Bansak et al. (2018) find that response quality does not degrade with up to 30 tasks on MTurk, and Bansak et al. (2021, "Beyond the Breaking Point") extend this to the attribute dimension, reporting stability at many attributes. These are the canonical sources for the task- and attribute-count claims respectively. Still evaluate whether the complexity of the levels increases cognitive load beyond the attribute count alone.
- Nested/Constrained Randomization: Not all attributes need to be fully crossed. When ecological validity demands it, certain attribute levels can be linked or nested within other attributes (e.g., origin countries nested within policy domain). This is acceptable when: (a) the nesting is theoretically justified, (b) the primary attributes of interest remain fully independently randomized, and (c) the analyst acknowledges that nested attributes cannot be cleanly separated from their parent attribute. See Auspurg & Hinz (2015) on restricted randomization in factorial surveys.
- Attribute-Level Restrictions: Implausible combinations can be excluded when they would confuse respondents or produce artifactual responses, but this is a judgment call, not a mandate. Eye-tracking evidence from Bansak & Jenke (2025) shows that odd (incongruent or nonsensical) attribute combinations have minimal, inconsistent effects on respondent attention, search, and choice, so decisions to include or exclude them should be driven primarily by statistical, substantive, and theoretical considerations (e.g., whether the randomization distribution should reflect a real-world target distribution per De la Cuesta, Egami & Imai 2022) rather than by assumed cognitive-burden concerns. Document all restrictions in the pre-analysis plan.
- Medium-Level Specificity: Attribute levels should be concrete enough to be meaningful but not so specific that they introduce unintended confounds. Describe treatments at a "medium level of specificity" -- "fully described but not overly described" (Sniderman 2018). Avoid vague descriptions (e.g., "a policy that helps the economy") and overly narrow ones (e.g., "a $2.3B infrastructure bill for Route 95 in Pennsylvania").
2. Statistical Power and Error Logic
- Effective N (N_eff): Calculate sample size based on (Respondents $\times$ Tasks $\times$ Profiles). Throughout this section, N_eff refers to this effective number of profile evaluations. However, respondents and tasks are not interchangeable -- adding respondents improves precision more than adding tasks per respondent due to within-respondent correlation. When in doubt, prioritize more respondents over more tasks (Stefanelli and Lukac 2020).
- Closed-Form Formula: The standard error of an AMCE is approximately: SE = $\sqrt{\text{Var}(Y) \times L / N_{\text{eff}}}$, where $L$ is the number of levels for the attribute and N_eff is as defined above (Schuessler and Freitag 2020). This provides a quick diagnostic for whether precision is adequate.
- Interaction Power: Estimating interaction effects requires approximately twice the sample size needed for main AMCEs, in the canonical balanced two-level-by-two-level case; the exact multiplier scales with the number of levels on each interacting attribute. The standard error of an interaction coefficient is approximately $\sqrt{2}$ times the SE of the corresponding main effect in that balanced case (Schuessler and Freitag 2020). Budget accordingly when interaction hypotheses are confirmatory.
- Empirical AMCE Benchmarks: Typical AMCEs in published conjoint studies range from 0.02 to 0.10 (percentage-point changes in selection probability), with a median around 0.05 (Stefanelli and Lukac 2020). Very large AMCEs (> 0.15) are rare. Use these benchmarks when setting the smallest effect size of interest (SESOI) if no prior data are available.
- Minimum Detectable Effect (MDE): Set the MDE based on the attribute with the highest number of levels, as this level will be the most difficult to estimate precisely. Report whether the MDE falls within the range of plausible AMCEs given prior literature.
- Type S and Type M Errors: When power is low, beware of "Type S" (Sign) errors (getting the direction wrong) and "Type M" (Magnitude) errors (exaggerating the effect size). At 50% power for a true effect of d = 0.5, the probability of a Type S error is approximately 1/18, and the expected Type M error (exaggeration ratio) is approximately 1.5 (Lakens 2025, citing Gelman and Carlin 2014).
- Low-N_eff Danger Zone (rule of thumb): Designs with fewer than ~3,000 effective profile evaluations are at high risk of being underpowered for detecting typical AMCE magnitudes (0.02–0.05). This threshold is a pragmatic rule of thumb derived from plugging median-AMCE benchmarks into the Schuessler and Freitag (2020) formula, not a research finding; adjust based on the expected AMCE magnitude, number of levels, and design. Below this threshold, conduct an explicit sensitivity analysis showing what effects can be detected.
- Levels-Power Tradeoff: Each additional level for an attribute reduces the effective number of observations per level. As an illustration, going from 4 to 5 levels reduces per-level N by about 20%, with a corresponding precision loss (Schuessler and Freitag 2020). Only add levels when each is theoretically necessary.
- Multiple Testing: Conjoint designs estimate many AMCEs simultaneously, which inflates the family-wise false-positive rate. Even under the sharp null of no effects, a typical conjoint pipeline produces at least one significant AMCE in more than 90% of experimental trials (Liu and Shiraito 2023); this mirrors the broader "garden of forking paths" problem (Gelman and Loken 2014) and the undisclosed-flexibility findings of Simmons, Nelson, and Simonsohn (2011), which motivate pre-specification and correction. Pre-specify a correction method: Bonferroni (most conservative, guards against false positives), Benjamini-Hochberg (controls FDR, most lenient), or adaptive shrinkage (balanced; preferred in Liu and Shiraito's simulations). Report both corrected and uncorrected results for confirmatory hypotheses.
- Cohen's d Warning: Do not use Cohen's d benchmarks (small = 0.2, medium = 0.5, large = 0.8) to calibrate conjoint power analyses. AMCEs are measured in percentage-point changes in choice probability, not in standard deviation units. Translating between the two requires knowing Var(Y), which depends on the choice task structure.
- Tools: Use the
cjpowRR package (Freitag 2021) or the associated Shiny app for simulation-based power analysis. These allow specification of the number of attributes, levels, tasks, and profiles, and return power curves for main effects and interactions. For a general declare-diagno