Conjoint Data Cleaning Expert
Instructions
1. Qualtrics Export Settings and Metadata
Export format: When exporting from Qualtrics, select "Use choice text" (not "Use numeric values") so that attribute levels appear as human-readable labels. If working with non-Latin scripts (Chinese, Korean, Arabic), export as XLSX rather than CSV to avoid UTF-8/ANSI encoding issues. On Windows with East Asian locales, read.csv() may still require Sys.setlocale() to match the file encoding before import (see ?cjoint::read.qualtrics East Asian Language Support).
Metadata rows: Current Qualtrics CSV exports include 3 header rows before respondent data: (1) variable identifiers, (2) question text/descriptions, (3) ImportId JSON. Legacy exports have 2 rows. The cjoint::read.qualtrics() parameter new.format = TRUE (set explicitly; default is FALSE) handles the 3-row format. For manual import via readxl::read_excel() or readr::read_csv(), skip the appropriate number of metadata rows after reading headers.
Randomization order columns: If "Export viewing order data" is enabled, Qualtrics adds _DO_ columns (e.g., Block1_DO) containing pipe-separated integers showing element display order. These are useful for task-order robustness checks but are not needed for the core reshape.
2. Qualtrics Conjoint Implementation Methods
Qualtrics conjoint experiments use one of three implementation methods, each producing different column naming conventions:
Method A — Conjoint Survey Design Tool (Strezhnev): Generates JavaScript that Qualtrics executes to randomize profiles. Column naming follows F-{task}-{profile}-{attribute} for attribute levels and F-{task}-{attribute} for attribute names. The cjoint R package's read.qualtrics() function is purpose-built for this format.
Method B — Custom JavaScript + Embedded Data: Researchers write JavaScript to randomize attributes and store values in Qualtrics embedded data fields. Column naming is researcher-defined. Two common conventions: (i) C{x}-F-{task}-{idx} for attribute names and C{x}-F-{task}-{profile}-{idx} for profile values; (ii) the Graham (2020) convention, choice{task}_{attr}{profile} with fixed attribute order (e.g., choice1_bread1) or c{task}_attrib{pos}_name / c{task}_attrib{pos}_sand{profile} when attribute order is also randomized. Requires manual reshaping (Section 4).
Method C — Loop & Merge: Each loop iteration represents one conjoint task. Embedded data fields are referenced via ${e://Field/variable_name} and displayed with ${lm://Field/N}. Column names reflect the embedded data field structure. Requires manual reshaping.
Before writing any cleaning code: Inspect the actual column headers, the QSF survey definition file, or any JavaScript in the survey to determine which method was used. Do not assume a column naming convention.
3. Existing R Packages for Conjoint Data Import
Before writing custom reshaping code, check whether an existing package handles the data format:
cjoint::read.qualtrics() — Purpose-built for Conjoint SDT exports (Method A). Reads Qualtrics CSV directly, handles metadata rows, outputs one row per profile with a selected column. Parameters: responses (choice column names), covariates (respondent-level variables), respondentID, new.format (TRUE for 3-row headers), ranks (for rank/rating/top-L designs). Supports binary forced choice, profile ranks, per-profile ratings, and top-L choices; see ?cjoint::read.qualtrics Details for the four response types. Requires PHP/JS output from the Conjoint Survey Design Tool.
cjdata::reshape_conjoint() — Lightweight alternative. Functions: read_Qualtrics() + reshape_conjoint(). Handles basic wide-to-long conversion. Requires the terminal character of each outcome string to be {"1","2"} or {"A","B"} (so "Candidate A" works; Japanese zenkaku digits supported). Respondent covariates merged separately.
projoint::reshape_projoint() — For measurement-error-corrected analysis per Clayton, Horiuchi, Kaufman, King, and Komisarchik (2023). Built-in support for repeated tasks (IRR estimation), missing-agreement imputation (.fill = TRUE), and bias-corrected AMCEs. Outcome column names must contain task-ID digits, and the repeated-task outcome must be the last element of .outcomes. Expects wide columns named K-{task}-{attribute} and K-{task}-{profile}-{attribute} by default (.alphabet = "K"); selected profile is parsed from the final character of each outcome string via .choice_labels (default c("A","B")). Specify .flipped = TRUE when the repeated task presents profiles in reversed left/right order (see exampleData1 vs. exampleData2 in the manual); this changes how agreement is computed. Trap: projoint::read_Qualtrics() hard-codes a 2-row metadata skip (legacy format). For current 3-row Qualtrics exports, pre-strip the third metadata row or read manually via readr::read_csv(skip = 3) before calling reshape_projoint().
cregg::cj_tidy() — Reshapes wide data across the three-level respondent/task/profile hierarchy via two named lists: profile_variables (features and profile-specific outcomes that vary within a task) and task_variables (variables that vary by task but not across profiles within it). Crucially, a choice variable that names the chosen profile ("left"/"right") goes in task_variables and must be recoded after reshaping, whereas per-profile "chosen" indicators go in profile_variables — getting this wrong silently corrupts the outcome. Constraint handling is not a cj_tidy feature; two-way design constraints are specified downstream via * in the amce()/cj() formula.
Package decision rule (default then escape hatch):
- Method A (SDT) export, no measurement-error correction needed →
cjoint::read.qualtrics() - Measurement-error / IRR correction via a repeated task →
projoint::reshape_projoint() - Simple Qualtrics CSV with binary outcome and default column naming →
cjdata::reshape_conjoint() - Complex wide data with non-standard column names that still map cleanly to profile/task variables →
cregg::cj_tidy() - Method B/C exports with custom embedded-data naming, language translation, attribute-level merges, or pilot-data exclusion that existing packages cannot handle → manual reshape (Section 4)
4. Manual Wide-to-Long Reshaping
When existing packages cannot handle the data format, reshape manually. The goal: one row per respondent x task x profile, one column per attribute.
Step 1: Build a long table of (ResponseId, task, profile, attribute_name, attribute_value)
Iterate over tasks, profiles, and attribute positions. For each combination, read the attribute name from the name column and the corresponding value from the value column. This naturally handles randomized attribute order.
rows <- vector("list", T * P * K)
i <- 0L
for (task in seq_len(T)) {
name_cols <- paste0(prefix, "-F-", task, "-", seq_len(K))
for (profile in seq_len(P)) {
val_cols <- paste0(prefix, "-F-", task, "-", profile, "-", seq_len(K))
for (idx in seq_len(K)) {
i <- i + 1L
rows[[i]] <- data.frame(
ResponseId = data$ResponseId,
task = task,
profile = profile,
attribute_name = data[[name_cols[idx]]],
attribute_value = data[[val_cols[idx]]],
stringsAsFactors = FALSE
)
}
}
}
long <- data.table::rbindlist(rows)
For Graham-style embedded fields (e.g., choice1_bread1, c1_attrib1_name/c1_attrib1_sand1), a tidyr::pivot_longer(names_pattern = ...) one-shot is often cleaner than the triple loop — match the numeric indices into task/profile/attribute-position columns, then pivot back wide on attribute_name. Use data.table::rbindlist() for performance on large datasets.
Step 2: Filter missing data. Remove rows where attribute_name or `attribute_value