CRM Hygiene Scanner
Scan a CSV export of CRM data (contacts, companies, or deals), detect duplicates, flag stale records, measure data completeness, calculate an overall quality score, and produce a prioritized cleanup plan. This is a read-only audit — the original file is never modified.
Tools Used
- Read — load the CSV file and any existing hygiene reports
- Write — save the hygiene report to
docs/crm-hygiene-report.md - Bash — run Python one-liners for CSV parsing, fuzzy matching, and statistical analysis
- Glob — check for existing reports before overwriting
Methodology
Follow these steps in order. Do not skip steps. Do not fabricate data.
Step 1: Input
Ask the user for four pieces of information before proceeding:
To run the CRM hygiene scan, I need:
- CRM export file path — path to the CSV file on disk
- Data type — is this contacts, companies, or deals/opportunities?
- CRM system — which CRM did this come from? (HubSpot, Salesforce, Pipedrive, or other)
- Critical fields — which fields matter most for your business? (e.g., email, phone, company name, deal stage, last activity date, owner)
Wait for all four answers. Do not assume defaults.
Once the user responds, validate the CSV path exists using Read. If the file is not found, tell the user and ask for the correct path.
Step 2: Data Profiling
Read the CSV and build a profile of the dataset. Use Bash with Python one-liners for parsing.
Produce these metrics:
| Metric | How to Calculate |
|---|---|
| Total records | Row count (excluding header) |
| Column inventory | List every column name |
| Data types | Infer type per column: text, email, phone, date, number, boolean, URL |
| Fill rate per column | (non-empty cells / total rows) * 100 — report as percentage |
| Date range | Oldest and newest value across all date columns |
| Unique vs. total values | For each key field (email, company name, phone), count unique values vs. total rows |
| Row completeness | Average number of filled columns per row, as a percentage of total columns |
Present the profile to the user as a table before continuing. This gives them a chance to flag any column mapping issues early.
For CSVs with more than 10,000 rows: read the first 10,000 rows for profiling and extrapolate. Clearly note that results are based on a sample and state the sample size.
Step 3: Duplicate Detection
Scan for three types of duplicates. Present each group with a confidence level.
3a. Exact Duplicates (Confidence: HIGH)
- Same email address (case-insensitive, whitespace-trimmed)
- Same company name AND same contact name (case-insensitive)
3b. Fuzzy Duplicates (Confidence: MEDIUM)
- Similar company names after normalization. Normalize by:
- Lowercasing
- Stripping legal suffixes: Inc, Inc., Incorporated, LLC, Ltd, Ltd., GmbH, AG, Corp, Corp., Co, Co.
- Stripping punctuation and extra whitespace
- Comparing the normalized strings — flag if they match or if the edit distance is 2 or fewer characters
- Similar contact names accounting for:
- First name vs. nickname (e.g., "Robert" vs. "Bob", "William" vs. "Will")
- Reversed first/last name order
- Middle name included vs. omitted
3c. Cross-Field Duplicates (Confidence: MEDIUM)
- Different email but same phone number AND same name
- Different email but same company AND same job title (likely the same person who changed email)
Duplicate Output Format
Group duplicates together. For each group, present:
Duplicate Group #{n} — Confidence: {HIGH/MEDIUM}
| Row | Name | Email | Company | Phone | Last Activity | Completeness |
|-----|------|-------|---------|-------|---------------|--------------|
| ... | ... | ... | ... | ... | ... | {n}/{total} fields filled |
Recommendation: Keep Row {X} (most complete + most recent activity). Merge data from Row {Y} into Row {X} before deleting.
For each group, recommend which record to keep based on:
- Most fields filled (highest completeness)
- Most recent last activity date
- If tied, keep the one with a valid email address
Step 4: Stale Record Detection
Flag records based on inactivity. Adapt the criteria to the data type the user specified in Step 1.
For Contacts
| Classification | Criteria |
|---|---|
| Stale (90-180 days) | No activity in 90-180 days, or no email sent/received in 60-90 days |
| Dormant (180-365 days) | No activity in 180-365 days |
| Dead (365+ days) | No activity in over 365 days |
Also flag:
- Contacts with no associated company (orphaned contacts)
- Contacts with no email address AND no phone number (unreachable)
For Companies
| Classification | Criteria |
|---|---|
| Stale (90-180 days) | No associated contact activity in 90-180 days |
| Dormant (180-365 days) | No associated contact activity in 180-365 days |
| Dead (365+ days) | No associated contact activity in over 365 days |
Also flag:
- Companies with zero associated contacts (empty accounts)
For Deals/Opportunities
| Classification | Criteria |
|---|---|
| Stale (30-60 days) | Deal stuck in the same stage for 30-60 days |
| Dormant (60-180 days) | Deal stuck in the same stage for 60-180 days |
| Dead (180+ days) | Deal stuck in the same stage for 180+ days, or no activity for 180+ days |
Also flag:
- Deals with no associated contact
- Deals with no owner assigned
- Deals with a close date in the past but still marked as open
If the CSV does not contain a last activity date or equivalent column, tell the user which column you looked for, report that staleness detection is limited, and skip to Step 5.
Present staleness results as a summary table:
| Classification | Count | % of Total | Action |
|---|---|---|---|
| Active (< 90 days) | ... | ...% | No action needed |
| Stale (90-180 days) | ... | ...% | Re-engage or update |
| Dormant (180-365 days) | ... | ...% | Archive or run win-back |
| Dead (365+ days) | ... | ...% | Archive immediately |
Step 5: Missing Data Analysis
For each field the user identified as critical in Step 1, report:
| Field | Records Missing | % Missing | Impact |
|---|---|---|---|
| {field name} | {count} | {percentage} | {impact assessment} |
Impact Assessment
Classify the impact of each missing field:
- Critical — This field is required for outreach or pipeline management. Missing it blocks action. Examples: email on a contact, deal stage on an opportunity, company name on an account.
- High — This field significantly improves targeting or personalization. Missing it reduces effectiveness. Examples: job title, industry, company size.
- Medium — This field is useful but not blocking. Examples: phone number, LinkedIn URL, address.
- Low — Nice to have. Examples: Twitter handle, secondary email.
Enrichment Recommendations
For each critical or high-impact missing field, suggest enrichment tools:
| Missing Field | Recommended Tools |
|---|---|
| Apollo.io, Hunter.io, Clearbit, Snov.io | |
| Phone | Apollo.io, Lusha, ZoomInfo |
| Company size / industry | Clearbit, Apollo.io, LinkedIn Sales Navigator |
| Job title | LinkedIn Sales Navigator, Apollo.io |
| LinkedIn URL | Apollo.io, Phantombuster |
| Company website | Clearbit, manual Google search |
| Revenue / funding | Crunchbase, PitchBook, Clearbit |
High-Impact Missing Records
List the top 10 records where missing data has the highest business impact. Prioritize by:
- Records in active deals or hot pipeline stages
- Records with recent activity but missing contact info
- Records matching ICP but missing fields needed for outreach
Step 6: Data Quality Score
Calculate an overall hygiene score on a 0-100 scale using four weighted components.
Component 1: Fill Rate of Critical Fields (40% weight)
Average the fill rates of the fields the user identified as critical in Step 1.
fill