Topic Modeling for Survey and Experimental Text Data
Instructions
1. Model Selection
- Default to Structural Topic Models (STM) when analyzing text from surveys or experiments. STM incorporates document-level metadata — treatment conditions, respondent demographics, country — directly into estimation, allowing prevalence and content to vary with covariates (Roberts et al. 2014).
- Use standard LDA only when no document-level covariates are needed and the corpus is large enough for unsupervised discovery (Blei, Ng & Jordan 2003).
- Consider BERTopic when working with short texts where word co-occurrence is sparse, or when multilingual embedding-based similarity is required (Grootendorst 2022). BERTopic clusters document embeddings via HDBSCAN, which yields hard single-topic-per-document assignments (unlike STM/LDA mixed membership), c-TF-IDF topic words that can be unstable on small corpora, and no native covariate framework. Use it when embedding-based semantic similarity matters more than mixed-membership structure or covariate estimation.
- Recognize the distinction between topic categorization and attitude inference: topic models reveal what respondents discussed, not what they believe. If the goal is inferring latent attitudes rather than categorizing surface content, complement topic modeling with an embeddings-based scaling of contextually common words or a separate supervised classifier (Hobbs & Green 2025; see also the companion
text-classificationskill).
2. Preprocessing
- Make preprocessing decisions explicit and justify each choice. Preprocessing is not neutral — stemming, stopword removal, and term-frequency thresholds all affect which topics emerge (Denny & Spirling 2018).
- Lowercase all text. Remove punctuation and numbers unless they carry substantive meaning in the domain.
- Remove stopwords using a standard list, but inspect the list for domain-relevant terms that should be retained (e.g., "foreign" in immigration research).
- Apply stemming only after checking that it does not collapse substantively distinct terms. Compare results with and without stemming as a robustness check (Denny & Spirling 2018).
- Set a lower-frequency threshold to prune very rare terms, and express it as a fraction of documents rather than an absolute count so it scales with corpus size. For small open-ended corpora a fixed count of 2–5 documents is often appropriate; for larger corpora, pruning terms that appear in fewer than roughly 0.5%–1% of documents is a common starting point (Grimmer & Stewart 2013). In
stm, useplotRemoved()to visualize how many documents and terms are dropped across candidate thresholds before committing to one, then pass it toprepDocuments(). Report the threshold chosen and the counts of terms and documents retained. - For translated text, preprocess after translation. Note that translation artifacts (e.g., inconsistent phrasing across translators) may affect topic coherence — document any translation pipeline.
3. Model Specification
- Specify the prevalence formula to include theoretically relevant covariates. For survey experiments, include treatment conditions; for cross-national data, include country. Example:
prevalence = ~ treatment + country(Roberts et al. 2014). For experimental applications, pre-register the prevalence-covariate hypotheses (see the companionpre-registration-writingskill) so thatestimateEffect()outputs are confirmatory rather than exploratory by default (Nosek et al. 2018). - Optionally specify a content formula when you expect the words associated with a topic (not just its prevalence) to vary by covariate — e.g., different framings of the same topic across countries or treatment arms. Content covariates parameterize word distributions via SAGE-style deviations from a baseline, which complicates direct comparison of β across groups; inspect group-specific vocabulary with
sageLabels()and interpret with care (Roberts, Stewart & Tingley 2014; 2019). - Use spectral initialization (
init.type = "Spectral") for reproducibility. Spectral initialization is deterministic on a given machine/BLAS configuration, unlike random initialization which requires multiple runs; cross-machine numerical precision can still produce minor differences, so record the hardware/OS when sharing replication materials (Roberts, Stewart & Tingley 2019). - Set a random seed and record it regardless of initialization method. Also record the
stmpackage version and R session info for DA-RT-compliant replication.
4. Selecting the Number of Topics
- Do not rely on a single metric. Evaluate candidate models across a range of K values (e.g., K = 5 to 30) using multiple diagnostics: semantic coherence, exclusivity, held-out likelihood, and residuals (Roberts, Stewart & Tingley 2019). In
stm,searchK()sweeps over K and returns all four diagnostics at once;selectModel()fits multiple initializations at a fixed K and returns the coherence-exclusivity frontier (useful when spectral initialization is not being used). - Semantic coherence measures whether high-probability words within a topic co-occur in the corpus. Higher is better, but coherence alone favors very common topics (Mimno et al. 2011). When a topic scores poorly, inspect the failure mode: Mimno et al. describe chained (two concepts linked by a shared word), intruded (unrelated words mixed in), random (no coherent connections), and unbalanced (mixes very general and very specific terms) — the diagnosis shapes whether the fix is K, preprocessing, or covariate specification.
- Exclusivity measures whether high-probability words are distinctive to one topic. The
stmpackage reports FREX, a weighted harmonic mean of frequency and exclusivity ranks (default weight ω = 0.7). The coherence-FREX frontier identifies models that balance both (Roberts, Stewart & Tingley 2019). - Treat held-out likelihood as one input among several, not the deciding metric. Chang et al. (2009) show that predictive likelihood can be negatively correlated with human interpretability — topic models that fit held-out words better are not necessarily more semantically coherent. When coherence, FREX, and held-out likelihood disagree, prefer interpretability (Chang et al. 2009).
- After narrowing candidates statistically, read the top words and representative documents for each topic. The final K should be interpretable — topics should be substantively meaningful and distinguishable from one another (Chang et al. 2009).
- Report the range of K values considered, the diagnostics used, and the rationale for the final choice.
5. Interpretation and Validation
- For each topic, report the top 10-15 words by probability and by FREX (frequency-exclusivity weighted). FREX words are often more interpretable because they are both common within the topic and rare outside it (Roberts, Stewart & Tingley 2019).
- Extract and read 3-5 representative documents per topic using
findThoughts()or equivalent. Do not interpret topics from word lists alone — the documents provide essential context. - Assign short substantive labels to each topic based on both word lists and representative documents. Labels should be descriptive (e.g., "economic contribution concerns") not technical (e.g., "Topic 7").
- Estimate covariate effects on topic prevalence using
estimateEffect(). For experimental data, test whether treatment conditions significantly shift which topics respondents discuss. For cross-national data, test whether topic prevalence differs by country. Plot these effects with confidence intervals (Roberts et al. 2014). When testing many topics against a treatment, report corrections for multiple comparisons (e.g., FDR) alongside uncorrected estimates. - To guard against spurious treatment-on-topic effects, use
permutationTest()(Roberts, Stewart & Tingley 2014; 2019). The function permutes the treatment label and refits the STM, returning a null