CRM Hygiene Scanner

Scan a CSV export of CRM data (contacts, companies, or deals), detect duplicates, flag stale records, measure data completeness, calculate an overall quality score, and produce a prioritized cleanup plan. This is a read-only audit — the original file is never modified.

Tools Used

Read — load the CSV file and any existing hygiene reports
Write — save the hygiene report to docs/crm-hygiene-report.md
Bash — run Python one-liners for CSV parsing, fuzzy matching, and statistical analysis
Glob — check for existing reports before overwriting

Methodology

Follow these steps in order. Do not skip steps. Do not fabricate data.

Step 1: Input

Ask the user for four pieces of information before proceeding:

To run the CRM hygiene scan, I need:

CRM export file path — path to the CSV file on disk

Data type — is this contacts, companies, or deals/opportunities?

CRM system — which CRM did this come from? (HubSpot, Salesforce, Pipedrive, or other)

Critical fields — which fields matter most for your business? (e.g., email, phone, company name, deal stage, last activity date, owner)

Wait for all four answers. Do not assume defaults.

Once the user responds, validate the CSV path exists using Read. If the file is not found, tell the user and ask for the correct path.

Step 2: Data Profiling

Read the CSV and build a profile of the dataset. Use Bash with Python one-liners for parsing.

Produce these metrics:

Metric	How to Calculate
Total records	Row count (excluding header)
Column inventory	List every column name
Data types	Infer type per column: text, email, phone, date, number, boolean, URL
Fill rate per column	`(non-empty cells / total rows) * 100` — report as percentage
Date range	Oldest and newest value across all date columns
Unique vs. total values	For each key field (email, company name, phone), count unique values vs. total rows
Row completeness	Average number of filled columns per row, as a percentage of total columns

Present the profile to the user as a table before continuing. This gives them a chance to flag any column mapping issues early.

For CSVs with more than 10,000 rows: read the first 10,000 rows for profiling and extrapolate. Clearly note that results are based on a sample and state the sample size.

Step 3: Duplicate Detection

Scan for three types of duplicates. Present each group with a confidence level.

3a. Exact Duplicates (Confidence: HIGH)

Same email address (case-insensitive, whitespace-trimmed)
Same company name AND same contact name (case-insensitive)

3b. Fuzzy Duplicates (Confidence: MEDIUM)

Similar company names after normalization. Normalize by:
- Lowercasing
- Stripping legal suffixes: Inc, Inc., Incorporated, LLC, Ltd, Ltd., GmbH, AG, Corp, Corp., Co, Co.
- Stripping punctuation and extra whitespace
- Comparing the normalized strings — flag if they match or if the edit distance is 2 or fewer characters
Similar contact names accounting for:
- First name vs. nickname (e.g., "Robert" vs. "Bob", "William" vs. "Will")
- Reversed first/last name order
- Middle name included vs. omitted

3c. Cross-Field Duplicates (Confidence: MEDIUM)

Different email but same phone number AND same name
Different email but same company AND same job title (likely the same person who changed email)

Duplicate Output Format

Group duplicates together. For each group, present:

Duplicate Group #{n} — Confidence: {HIGH/MEDIUM}
| Row | Name | Email | Company | Phone | Last Activity | Completeness |
|-----|------|-------|---------|-------|---------------|--------------|
| ... | ...  | ...   | ...     | ...   | ...           | {n}/{total} fields filled |

Recommendation: Keep Row {X} (most complete + most recent activity). Merge data from Row {Y} into Row {X} before deleting.

For each group, recommend which record to keep based on:

Most fields filled (highest completeness)
Most recent last activity date
If tied, keep the one with a valid email address

Step 4: Stale Record Detection

Flag records based on inactivity. Adapt the criteria to the data type the user specified in Step 1.

For Contacts

Classification	Criteria
Stale (90-180 days)	No activity in 90-180 days, or no email sent/received in 60-90 days
Dormant (180-365 days)	No activity in 180-365 days
Dead (365+ days)	No activity in over 365 days

Also flag:

Contacts with no associated company (orphaned contacts)
Contacts with no email address AND no phone number (unreachable)

For Companies

Classification	Criteria
Stale (90-180 days)	No associated contact activity in 90-180 days
Dormant (180-365 days)	No associated contact activity in 180-365 days
Dead (365+ days)	No associated contact activity in over 365 days

Also flag:

Companies with zero associated contacts (empty accounts)

For Deals/Opportunities

Classification	Criteria
Stale (30-60 days)	Deal stuck in the same stage for 30-60 days
Dormant (60-180 days)	Deal stuck in the same stage for 60-180 days
Dead (180+ days)	Deal stuck in the same stage for 180+ days, or no activity for 180+ days

Also flag:

Deals with no associated contact
Deals with no owner assigned
Deals with a close date in the past but still marked as open

If the CSV does not contain a last activity date or equivalent column, tell the user which column you looked for, report that staleness detection is limited, and skip to Step 5.

Present staleness results as a summary table:

| Classification | Count | % of Total | Action |
|---|---|---|---|
| Active (< 90 days) | ... | ...% | No action needed |
| Stale (90-180 days) | ... | ...% | Re-engage or update |
| Dormant (180-365 days) | ... | ...% | Archive or run win-back |
| Dead (365+ days) | ... | ...% | Archive immediately |

Step 5: Missing Data Analysis

For each field the user identified as critical in Step 1, report:

Field	Records Missing	% Missing	Impact
{field name}	{count}	{percentage}	{impact assessment}

Impact Assessment

Classify the impact of each missing field:

Critical — This field is required for outreach or pipeline management. Missing it blocks action. Examples: email on a contact, deal stage on an opportunity, company name on an account.
High — This field significantly improves targeting or personalization. Missing it reduces effectiveness. Examples: job title, industry, company size.
Medium — This field is useful but not blocking. Examples: phone number, LinkedIn URL, address.
Low — Nice to have. Examples: Twitter handle, secondary email.

Enrichment Recommendations

For each critical or high-impact missing field, suggest enrichment tools:

Missing Field	Recommended Tools
Email	Apollo.io, Hunter.io, Clearbit, Snov.io
Phone	Apollo.io, Lusha, ZoomInfo
Company size / industry	Clearbit, Apollo.io, LinkedIn Sales Navigator
Job title	LinkedIn Sales Navigator, Apollo.io
LinkedIn URL	Apollo.io, Phantombuster
Company website	Clearbit, manual Google search
Revenue / funding	Crunchbase, PitchBook, Clearbit

High-Impact Missing Records

List the top 10 records where missing data has the highest business impact. Prioritize by:

Records in active deals or hot pipeline stages
Records with recent activity but missing contact info
Records matching ICP but missing fields needed for outreach

Step 6: Data Quality Score

Calculate an overall hygiene score on a 0-100 scale using four weighted components.

Component 1: Fill Rate of Critical Fields (40% weight)

Average the fill rates of the fields the user identified as critical in Step 1.

fill

crm-hygiene-scanner

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

learn-codebase

remove-deadcode

sendgrid-automation

seo

Recibe nuevas skills de Marketing todos los lunes

CRM Hygiene Scanner

Tools Used

Methodology

Step 1: Input

Step 2: Data Profiling

Step 3: Duplicate Detection

3a. Exact Duplicates (Confidence: HIGH)

3b. Fuzzy Duplicates (Confidence: MEDIUM)

3c. Cross-Field Duplicates (Confidence: MEDIUM)

Duplicate Output Format

Step 4: Stale Record Detection

For Contacts

For Companies

For Deals/Opportunities

Step 5: Missing Data Analysis

Impact Assessment

Enrichment Recommendations

High-Impact Missing Records

Step 6: Data Quality Score

Component 1: Fill Rate of Critical Fields (40% weight)

Comentarios · Sin comentarios