List Experiment Designer
Related skills: Use alongside hypothesis-building (state π and a SESOI before design choices), survey-design (mode effects, question ordering, and pre-testing of control items), and methods-reporting (deposit list wording, randomization seed, list package version, and ict.test / ict.hausman.test / ictreg() output).
Instructions
1. Pre-Design: Is a List Experiment Warranted?
- Assess sensitivity bias first: Before committing to a list experiment, consult domain-specific evidence on sensitivity bias. Blair, Coppock, and Moor's (2020) meta-analysis of 30 years of list experiments shows that sensitivity biases are typically smaller than 10 percentage points. A list experiment is not automatically the right choice for any sensitive topic.
- Social reference theory: Sensitivity bias is largest when (a) the social norm on the topic is strong, (b) the norm is clear and widely shared, and (c) respondents believe others can infer their true attitude from their response (Blair et al. 2020). Evaluate all three conditions before deciding.
- Precision cost: List experiments require approximately 10 times more respondents than a direct question to achieve equivalent precision. The trade-off is only favorable when the expected sensitivity bias exceeds the precision loss (Blair et al. 2020). If the topic is sensitive but the expected bias is small (< 5pp), a direct question with neutral framing is often preferable.
- Empirical benchmarks by domain: Voter turnout (~5–15pp overreport, wide confidence intervals), clientelism and vote-buying (~5–15pp underreport), racial prejudice (near-zero sensitivity bias — Blair et al. 2020 find little evidence respondents conceal prejudice on direct questions), authoritarian regime support (highly context-dependent and often dominated by artificial deflation rather than preference falsification). Use these as priors when no domain-specific estimates exist.
2. Basic Design
- Core logic: Assign respondents randomly to a control group (receives N baseline items) or a treatment group (receives N+1 items, including the sensitive item). Estimate prevalence as the difference in mean counts between treatment and control. This provides plausible deniability because no individual response can be traced to the sensitive item.
- Number of control items: Use 3–5 control items. Fewer items reduce plausible deniability (too easy to infer the sensitive item from a high count). More items increase cognitive load and floor/ceiling risk, and reduce statistical efficiency.
- Randomize item order: Randomize the order of all items within each respondent's list to prevent position effects from inflating or deflating specific items.
- Wording parity: Frame all items, including the sensitive item, in the same grammatical form and at the same abstraction level. Stylistic inconsistency makes the sensitive item stand out, undermining the design.
3. Control List Design
- The floor/ceiling constraint: Select control items so that virtually no respondent would endorse all N control items (ceiling) or zero control items (floor) when assigned to the treatment group. A respondent at the ceiling cannot truthfully report N+1 even if they hold the sensitive attitude; one at the floor cannot hide a "1" count. Both violations bias estimates downward (artificial deflation) and compromise identification (Blair & Imai 2012).
- Target prevalence range for control items: Each control item should have expected prevalence between 20% and 80% in the population. The sum of expected control item endorsements should have low variance — ideally each respondent endorses roughly 1–3 of N control items.
- Item independence: Control items should be uncorrelated with the sensitive item. If control items tap dimensions that predict the sensitive attitude, the no-design-effect assumption is threatened.
- Conventional vs. placebo vs. mixed control list design: Three design options exist for managing measurement error from inattentive respondents (Agerberg & Tannenberg 2021):
- Conventional: Standard baseline items, no placebo. Biased if nonstrategic error inflates or deflates counts.
- Placebo: Replace one control item with a universally false statement (all should answer "no") to equate list length in treatment and control. Riambau & Ostwald (2021) show this reduces mechanical inflation — the tendency for treatment group respondents to report more true items simply due to list length, especially among low-education respondents. However, Agerberg & Tannenberg (2021) show placebo items do not universally reduce bias and can increase it under some conditions.
- Mixed control list: Combines conventional and placebo items. Preferred when the expected direction and magnitude of sensitivity bias is uncertain.
- Choice rule: When mechanical inflation is the primary concern (inattentive or acquiescent respondents inflating counts), include a placebo item. When artificial deflation is the primary concern (ceiling effects), use a conventional design and select items with lower prevalence. When uncertain, use the mixed control list.
4. Design Variants
- Single list experiment: Standard design. Simplest to implement; most commonly used. The sensitive item appears only in the treatment list.
- Double list experiment (DLE): Each respondent sees the sensitive item in one of two lists (as a treatment item in one list, as part of a second control list in another). Reduces variance by using all respondents for estimation. However, Diaz (2024) shows that DLEs require additional diagnostic testing for carryover design effects — if respondents respond differently to control items after having seen the sensitive item in another list, the DLE identification assumption fails.
- Design preconditions for diagnostics: Diaz's (2024) tests apply only to fixed-randomized or randomized-randomized DLEs (the location of the sensitive item must be randomized across respondents); they cannot diagnose carryover in fixed-fixed DLEs.
- Two tests, both to be reported: (i) a difference-in-differences test on the paired list means (equivalent to the Chuang et al. 2021 consistency test in the fixed-randomized case) and (ii) Stephenson's signed-rank test (Rosenbaum 2007, 2020) on the paired within-respondent differences. The difference-in-differences test has more power under response deflation than inflation; the signed-rank statistic can be positive under either inflation or non-zero true prevalence, so interpret one-sided deflation alternatives. Report both.
- Placebo-item design (Riambau & Ostwald 2021): A universally-false statement (e.g., "I have been invited to have dinner with PM Lee at Sri Temasek next week") added to the control list to equalize list length at J+1 / J+1. The placebo should be implausible enough to be false for all respondents but not so disruptive that it degrades attention on the remaining items. This is the design the placebo/mixed options in §3 rely on; do not confuse it with respondent-tailored ("piped-in") placebos, which are a separate and not-yet-validated variant.
5. Estimators
- Difference-in-means: $\hat{\pi} = \bar{Y}{T} - \bar{Y}{C}$. Unbiased estimator of the target estimand π — the population prevalence of the sensitive behavior or attitude (state π explicitly before design choices, per Lundberg, Johnson & Stewart 2021). Standard errors: $SE \approx \sqrt{Var(Y_T)/n_T + Var(Y_C)/n_C}$. Simple and transparent; appropriate when no covariate adjustment is needed and the research question is purely about aggregate prevalence.
- Multivariate NLSreg: Nonlinear least squares regression that models the list experiment outcome as a function of covariates. More robust to nonstrategic measurement error than MLreg when respondents answer inattentively or randomly — NLSreg is more robust to misspecification because it does not imp