LLM-Based Text Classification for Social Science Research
Instructions
1. Codebook Design
- Before drafting the codebook, specify the population, sampling frame, and (for experimental data) the treatment condition each response is drawn from. These constrain which categories can plausibly exist and which demographic subgroups any bias assessment must cover. LLM classification extends, rather than replaces, the longer open-ended coding tradition in survey methodology (Geer 1988; Lupia 2018).
- Treat codebook design as the most consequential decision in the classification pipeline. LLMs struggle with loose instructions and revert to general-purpose definitions rather than following researcher-specific operationalizations (Halterman & Keith 2025).
- Structure each code with the following components (adapted from Halterman & Keith 2025):
- Label: The exact output string the model should return
- Definition: A single-sentence operationalization of the construct
- Clarification: What IS included — boundary cases that belong in this category
- Negative clarification: What is NOT included — common confusions and adjacent categories
- Examples: 2-3 positive examples (correctly classified) and 2-3 negative examples (common misclassifications)
- Keep the number of codes small (3-6) for initial classification. Larger coding schemes increase ambiguity and reduce inter-annotator agreement for both humans and LLMs (Chae & Davidson 2025).
- Allow multi-label assignment when responses may reflect more than one construct. Specify this explicitly in the prompt — models default to single-label output unless instructed otherwise.
- Include a residual category (e.g.,
none_of_aboveoruncodeable) for responses that are too vague, too short, or off-topic. Define this category as precisely as the substantive codes (Halterman & Keith 2025). - Iterate the codebook through pilot testing. Examine disagreements between LLM output and hand-coding to identify ambiguous definitions, then revise. Most codebook problems are definition problems, not model problems (Halterman & Keith 2025).
- For a fully-worked example of a codebook with all five components filled in for a realistic three-category classification task, plus a matching system prompt that operationalizes it, see
reference/example-codebook-and-prompt.md.
2. Choosing a Learning Regime
-
Follow the decision framework from Chae & Davidson (2025), which maps document characteristics and available resources to the appropriate approach:
Zero-shot prompting: Use when classifying short documents with a large decoder model (GPT-4o, Llama3-70B+) and no labeled training data. Best for rapid prototyping and tasks where constructs are well-defined. GPT-4o achieves the best zero-shot performance across tasks (Chae & Davidson 2025).
Few-shot prompting: Add labeled examples to the prompt. Results are inconsistent — adding examples helps some models but degrades others (Chae & Davidson 2025). Always compare few-shot against zero-shot on a held-out sample before committing. Select diverse examples covering edge cases, not just prototypical instances.
Fine-tuning: Train a model on labeled data. Effective with as few as 100 hand-coded examples for smaller models (Chae & Davidson 2025). Fine-tuned smaller models (Llama3-8B, GPT-3 Davinci) can match GPT-4o zero-shot performance. Prefer this when you have labeled data and need cost-effective classification at scale.
Instruction-tuning: Combine detailed prompting with fine-tuning on paired instruction-output examples. Most powerful regime for complex tasks — instruction-tuned Llama3-70B surpasses GPT-4o zero-shot on stance detection (Chae & Davidson 2025). Requires more technical infrastructure but yields the highest accuracy.
Encoder-only fine-tuning: A distinct fourth regime often omitted from generative-LLM discussions. Fine-tuning a smaller encoder-only model (BERT, DeBERTa, SBERT; ~86–110M parameters, personal-computer hardware) on modest labeled data can match or exceed zero-shot generative LLMs on many classification tasks at a fraction of the cost and with fully reproducible (deterministic) output (Chae & Davidson 2025, Table 1; Ziems et al. 2024 find fine-tuned RoBERTa rarely under-performs larger generative models across 20 tasks). Prefer encoder fine-tuning when the label set is fixed, labeled data exists, and reproducibility matters more than generative flexibility.
-
When resources permit, test multiple regimes on the same pilot sample and select based on empirical performance, not assumptions.
3. Model Selection and Reproducibility
- Prefer open-weight models (Llama 3, Gemma, Mistral) for publishable research. Open-weight models run locally produce substantially lower and more predictable variance across runs, while proprietary models (GPT-4, Gemini) show high and unpredictable variance even with temperature=0 (Barrie, Palmer & Spirling 2025).
- If using proprietary models, document the exact model identifier (e.g.,
gpt-4o-2024-08-06), not the model family name. Commercial models are modified or deprecated without notice — GPT-3 was withdrawn from OpenAI's API entirely (Barrie, Palmer & Spirling 2025; Chae & Davidson 2025). - Set temperature to 0 for classification tasks. This reduces but does not eliminate stochastic variation in proprietary models (Barrie, Palmer & Spirling 2025).
- Run the same ~50 responses through the classifier twice, with meaningful separation (e.g., two weeks apart, or across a model-version change), and report the agreement rate between runs as a variance metric. These specific numbers (N = 50, ≥ 95% agreement as a "stable" threshold) are house defaults consistent with field conventions, not values established in a single cited study; Barrie, Palmer & Spirling (2025) motivate the test but do not fix the thresholds.
- When classifying across multiple languages or cultural contexts, validate per-language against hand-coded native-language ground truth. LLM classification accuracy is high across non-English settings but not uniform: GPT-4 tracks English accuracy (~90%) on Italian, German, and Chilean political tweets (Heseltine & Clemm von Hohenberg 2024), and remains above all supervised comparators across 11 countries on party-identification tasks, though absolute accuracy drops outside the United States (Tornberg 2025). Do not assume English-language validation carries over.
- Be aware that commercial models may refuse to classify politically sensitive content. Chae & Davidson (2025) found GPT-4o refused to process some Facebook comments about political candidates due to content moderation filters. For sensitive topics (immigration attitudes, extremism, hate speech), test for refusal rates before full deployment.
- Consider data privacy: survey responses sent to commercial APIs may be absorbed into training data (Chae & Davidson 2025). For data containing personally identifying information, use locally hosted open-weight models or confirm the API provider's data retention policy.
4. Prompt Construction
- Place the codebook in the system prompt. Include all components for each code (label, definition, clarification, negative clarification, examples).
- Specify the exact output format: code labels only, comma-separated if multi-label. Instruct the model to return no additional text. Smaller models in particular generate conversational preamble unless explicitly constrained (Chae & Davidson 2025).
- For structured or complex inputs, use JSON formatting for both input and expected output. LLMs trained on code corpora parse JSON reliably and produce more consistent structured output (Chae & Davidson 2025).
- Include the response text in the user message, separated clearly from instructions. Use a consistent delimiter (e.g.,
"Code this response:\n\n{text}"). - Do not include information in the prompt that the classifier should not use. If country of origin s