Data Wrangler
Manipulate tabular data (CSV, Excel, JSON, Parquet, TSV) w/ pandas-powered scripts. Two scripts cover all operations: data_wrangler.py for data ops, excel_toolkit.py for Excel-specific features.
When to Use
- User asks to read, edit, filter, sort, or transform CSV/Excel/JSON/Parquet/TSV files
- User asks to merge/join datasets, deduplicate, fill missing values, or validate data
- User asks to create Excel workbooks w/ formatting, dropdowns, freeze panes, or multi-sheet
- User asks to pivot, unpivot, group-by, aggregate, sample, or split datasets
- User asks to add computed columns, rename columns, cast types, or apply formulas
- User asks to convert between data formats (CSV -> Excel, JSON -> Parquet, etc.)
- User asks to inspect/profile data structure, types, nulls, stats
Prerequisites
# Required
pip install pandas openpyxl
# Optional (per feature)
pip install pyarrow # Parquet support
pip install xlrd # Legacy .xls read
pip install pandasql # SQL queries on DataFrames
pip install fastparquet # Alternative Parquet engine
Quick Routing
| Task | Script | Command |
|---|---|---|
| Inspect/profile data | data_wrangler.py | inspect |
| Filter rows | data_wrangler.py | filter --where "expr" |
| Sort by columns | data_wrangler.py | sort --by Col --desc |
| Group & aggregate | data_wrangler.py | group --by Col --agg "Col:func" |
| Merge/join files | data_wrangler.py | merge f2 --on Key --how left |
| Pivot / unpivot | data_wrangler.py | pivot --index/--unpivot |
| Remove duplicates | data_wrangler.py | dedupe --subset "Col" |
| Fill missing values | data_wrangler.py | fill --column Col --strategy mean |
| Drop cols/rows | data_wrangler.py | drop --columns "A,B" |
| Rename columns | data_wrangler.py | rename --map "old:new" |
| Cast types | data_wrangler.py | cast --column Col --dtype datetime |
| Computed columns | data_wrangler.py | derive --formula "New = A + B" |
| Random sample | data_wrangler.py | sample --n 100 |
| Split by values | data_wrangler.py | split --by Region |
| Validate rules | data_wrangler.py | validate --rules rules.json |
| Apply formulas | data_wrangler.py | formula --expr "C=A+B" |
| Convert formats | data_wrangler.py | convert -o data.xlsx |
| SQL queries | data_wrangler.py | query --sql "SELECT..." |
| List Excel sheets | excel_toolkit.py | sheets |
| Extract sheet | excel_toolkit.py | extract --sheet Sales -o sales.csv |
| Combine -> xlsx | excel_toolkit.py | combine *.csv -o combined.xlsx |
| Format headers | excel_toolkit.py | format --header-style bold,blue --autowidth |
| Freeze panes | excel_toolkit.py | freeze --at B2 |
| Auto-filter | excel_toolkit.py | autofilter |
| Dropdown validation | excel_toolkit.py | validate --column Status --values "Open,Closed" |
| Protect sheet | excel_toolkit.py | protect --password secret |
| Create workbook | excel_toolkit.py | create --columns "Name,Age" -o template.xlsx |
Usage Patterns
Data Operations (data_wrangler.py)
All operations follow: python3 scripts/data_wrangler.py <op> <input> [options] [-o output]
# Inspect
python3 data_wrangler.py inspect sales.csv
python3 data_wrangler.py inspect data.xlsx --sheet "Q1 Sales" --nrows 1000
# Filter
python3 data_wrangler.py filter data.csv --where "Revenue > 10000" -o high_rev.csv
python3 data_wrangler.py filter data.csv --where 'Status == "active" and Age >= 25' -o active.csv
# Sort
python3 data_wrangler.py sort data.csv --by "Revenue,Name" --desc -o sorted.csv
# Group + Aggregate
python3 data_wrangler.py group data.csv --by Department --agg "Salary:mean,Salary:count,Revenue:sum" -o summary.csv
# Merge
python3 data_wrangler.py merge orders.csv customers.csv --on CustomerID --how left -o joined.csv
# Pivot
python3 data_wrangler.py pivot data.csv --index Name --columns Month --values Sales --aggfunc sum -o pivoted.csv
# Unpivot (melt)
python3 data_wrangler.py pivot wide.csv --index ID --unpivot --var-name Metric --value-name Value -o long.csv
# Deduplicate
python3 data_wrangler.py dedupe data.csv --subset "Email" --keep first -o clean.csv
# Fill nulls
python3 data_wrangler.py fill data.csv --column "Revenue,Profit" --strategy mean -o filled.csv
# Drop columns
python3 data_wrangler.py drop data.csv --columns "TempCol,Notes" -o trimmed.csv
python3 data_wrangler.py drop data.csv --null-threshold 0.5 -o cleaned.csv
# Rename
python3 data_wrangler.py rename data.csv --map "old_name:new_name,col2:Column2" -o renamed.csv
python3 data_wrangler.py rename data.csv --snake -o snake_case.csv
# Cast types
python3 data_wrangler.py cast data.csv --column Date --dtype datetime --date-format "%Y-%m-%d" -o typed.csv
# Computed columns
python3 data_wrangler.py derive data.csv --formula "Profit = Revenue - Cost" -o enriched.csv
# Sample
python3 data_wrangler.py sample large.csv --n 500 --seed 42 -o sample.csv
# Split by value
python3 data_wrangler.py split data.csv --by Region --output-dir ./by_region/
# Validate
python3 data_wrangler.py validate data.csv --rules validation_rules.json -o report.json
# Formula
python3 data_wrangler.py formula data.xlsx --expr "Total=Price*Quantity" -o calculated.xlsx
# Convert
python3 data_wrangler.py convert data.csv -o data.xlsx
python3 data_wrangler.py convert data.xlsx -o data.json
python3 data_wrangler.py convert data.json -o data.parquet
# SQL query
python3 data_wrangler.py query data.csv --sql "SELECT Name, AVG(Salary) FROM df WHERE Dept='Eng' GROUP BY Name"
Excel Operations (excel_toolkit.py)
All operations follow: python3 scripts/excel_toolkit.py <op> <input> [options] [-o output]
# List sheets
python3 excel_toolkit.py sheets workbook.xlsx
# Extract sheet
python3 excel_toolkit.py extract workbook.xlsx --sheet "Sales Q1" -o sales_q1.csv
# Combine multiple files into multi-sheet xlsx
python3 excel_toolkit.py combine sales.csv inventory.csv orders.csv -o report.xlsx
# Format
python3 excel_toolkit.py format data.xlsx --header-style bold,blue --autowidth --zebra -o styled.xlsx
# Freeze panes
python3 excel_toolkit.py freeze data.xlsx --at B2 -o frozen.xlsx
# Auto-filter
python3 excel_toolkit.py autofilter data.xlsx -o filtered.xlsx
# Dropdown validation
python3 excel_toolkit.py validate data.xlsx --column Status --values "Open,Closed,Pending" -o validated.xlsx
# Protect
python3 excel_toolkit.py protect data.xlsx --password mypass -o protected.xlsx
# Create template
python3 excel_toolkit.py create --columns "Name,Email,Department,Start Date,Salary" -o template.xlsx
Validation Rules Format
Create a JSON rules file for validate:
{
"rules": [
{"column": "Email", "type": "not_null"},
{"column": "Email", "type": "pattern", "regex": "^[^@]+@[^@]+\\.[^@]+$"},
{"column": "ID", "type": "unique"},
{"column": "Age", "type": "range", "min": 0, "max": 150},
{"column": "Status", "type": "enum", "values": ["active", "inactive", "pending"]}
]
}
Rule types: not_null, unique, range (min/max), pattern (regex), enum (allowed values).
Fill Strategies
| Strategy | Behavior |
|---|---|
mean | Fill w/ column mean (numeric) |
median | Fill w/ column median (numeric) |
mode | Fill w/ most frequent value |
zero | Fill w/ 0 |
empty | Fill w/ empty string |
ffill | Forward fill (carry last value) |
bfill | Backward fill |
drop | Drop rows w/ nulls in column |
value:<v> | Fill w/ specific value |
Supported Formats
| Format | Read | Write | Dependency |
|---|---|---|---|
| CSV | Y | Y | (builtin) |
| TSV | Y | Y | (builtin) |
| XLSX | Y | Y | openpyxl |
| XLS | Y | N | xlrd |
| JSON | Y | Y | (builtin) |
| JSONL | Y | Y | (builtin) |
| Parquet | Y | Y | pyarrow |
Integration w/ file-converter
Pipeline data between skills:
# 1. Convert YAM