Niche Signal Discovery
Discover differential signals between Closed Won and Closed Lost accounts by extracting multi-page website content and job listings, then computing Laplace-smoothed lift scores to identify what distinguishes buyers from non-buyers.
Prerequisites
- Deepline CLI — All enrichment runs through
deepline enrich. No separate API keys for exa/crustdata/apollo etc. - Python 3 stdlib only — no pip dependencies for any shipped script.
- Credits — ~0.47 credits/company (serper 0.02 + firecrawl 0.05 + crustdata 0.40). Step 7 contact discovery is additional. Always get user approval before paid steps.
Deepline-First Principle
Use deepline enrich for all enrichment, deepline tools execute for one-offs, deepline playground for inspection. Reruns are idempotent. Refer to deepline-gtm for command patterns and provider playbooks.
Input requirements
- Won and lost customer domain lists (≥20 won + ≥10 lost for statistical significance)
- Lookalikes can supplement Won if Closed Won < 15. Add a Dataset Caveat to the report.
- Target company context from Step 0 — what they sell, who they sell to, key personas.
Pipeline
0. Discover target company (what they sell, who they sell to)
0.5. Discover ecosystem (competitors, tech stack, buyer personas)
1. Prepare input CSV (deduplicate within won/lost groups)
1.0.5 Build "do not re-contact" index from user's existing list (scripts/dedupe_utils.py)
1.5. Generate vertical-specific configs (keywords, tools, job roles)
2. Multi-page website + job extraction (deepline enrich)
3. Quality gate — verify file completeness + coverage (>80%)
3.5. Review configs against enriched data
4. Differential analysis (scripts/analyze_signals.py)
5. Generate report — every top signal must include cited evidence
6. Signal interpretation review
7. Top 10 net-new prospects [REQUIRED] + contacts/emails [optional, costs credits]
Step 7 is required. A signal report without 10 actionable companies forces the reader to do their own prospecting pass — exactly the expensive thing they wanted to skip. Contacts/emails are optional only because they cost extra credits; always offer them.
Signal reliability hierarchy
Highest → lowest confidence:
- Job listings — active budget + acknowledged pain. Highest-intent.
- Analyst validation (Gartner/Forrester) — typically 4-7x lift, rare in lost.
- Compliance infrastructure (SOC2/GDPR/ISO) — procurement maturity.
- Buyer pain language on careers/blog — operational awareness.
- Tech stack tools (niche SaaS) — infrastructure readiness.
- Website product/marketing content — variable; can be buyer OR competitor.
When website signals fail: For B2B back-office tools (AR, billing, compliance), buyers don't publish their pain on marketing pages. Prioritize jobs + tech stack + firmographics for these verticals.
What NOT to use for scoring
CRM fields populated by AE activity — catalyst note count, OCR-derived counts (number_of_champions_c, number_of_decision_makers_c), MEDDPICC picklists, any "did the AE do X on this opp" field — correlate with win-rate as engagement artifacts, not causal signals. They get filled in after the AE decides an opp is worth working. Never use them as scoring inputs. On one real run, catalyst notes showed "109x lift" — almost made the TL;DR before we caught the direction of causality.
Rule of thumb: every scoring input must be observable BEFORE the AE touches the account. Read references/scoring-pitfalls.md for the full list and the "safer alternative read" for loss-reason data.
Step 0: Target company discovery
Do this FIRST. The entire pipeline (exa query, keywords, tech stack, job roles) adapts based on this discovery; skipping it produces generic/irrelevant signals.
deeplineagent: "Research {{company-domain}}. Summarize what the company sells, who they sell to, what makes them different, and any example customers."
Document: (1) product category, (2) target buyer persona, (3) key differentiation, (4) example customers.
Step 0.5: Ecosystem discovery
Three parallel deeplineagent queries:
- Competitors —
"{product category} software alternatives competitors"→ 3-5 names - Tech stack —
"{buyer persona} software stack"→ 10-15 tools by category - Job roles —
"{buyer persona} job titles"→ 10-15 title variations
These feed Step 1.5 config generation.
Step 1: Prepare input CSV
domain,status
customer1.com,won
non-customer1.com,lost
Deduplicate within the input. If a domain appears in BOTH won and lost (same company, multiple deals), Deepline only fetches job listings once — silently undercounting won_with_jobs. Remove ALL rows for cross-group domains:
from collections import Counter
counts = Counter(r['domain'] for r in rows)
duplicate_domains = {d for d, c in counts.items() if c > 1}
# Drop every row in duplicate_domains, not just one copy.
Step 1.0.5: Build "do not re-contact" index
Before any prospects ship in Step 7, dedupe candidates against whatever "already known" list the user provides — customers, CRM export, past outbound, a previous run's output. Always ask explicitly; if the user has no list, note it as a caveat in the final report rather than silently skipping.
Order: apex domain first, fuzzy company name as fallback. Use the shipped helper — it handles public-suffix multi-label TLDs (co.uk, co.jp, com.au) and corporate-suffix stripping:
python3 scripts/dedupe_utils.py --selftest # one-time sanity check
python3 scripts/dedupe_utils.py \
--existing customers.csv --candidates prospects_raw.csv \
--out-actionable prospects_actionable.csv --out-matched already_known.csv
Don't silently drop CRM matches — categorize them: Net-new / Account-only / Re-engage / Active-open / Current-customer.
Read references/dedupe.md for the failure modes (raw-string match missing amsynergy.nikon.com → nikon.com cost 24 of 50 prospects in one run), category definitions, and library usage.
Step 1.5: Generate vertical-specific configs
Create three JSON files in output/{{company}}/:
{{company}}-keywords.json # product category, pain language, competitor names, maturity terms
{{company}}-tools.json # niche SaaS tools by category
{{company}}-job-roles.json # buyer persona job titles
Read references/keyword-catalog.md for the JSON schema, generation patterns, and multi-vertical examples (creative ops, AR automation, sales engagement, developer tools).
Validation: Do the configs match the target's vertical and buyer persona? If not, refine based on Step 0/0.5 findings.
Step 2: Deepline enrichment
Never scrape just the homepage. Use Serper to discover relevant pages, Firecrawl to extract content.
Step 2a - Discover pages with Serper (0.02 credits/company):
deepline enrich \
--input output/{{company}}-icp-input.csv \
--output output/{{company}}-discovered.csv \
--with '{"alias":"pages","tool":"serper_google_search","payload":{"query":"site:{{domain}} product OR features OR integrations OR customers OR security OR pricing OR careers OR about"}}' \
--json
Adapt the query by vertical: add compliance OR audit for back-office, documentation OR api for developer tools, portfolio OR workflow for creative tools.
Step 2b - Scrape top 5 pages with Firecrawl (0.05 credits/company):
Extract URLs from Serper results, then scrape each:
deepline enrich \
--input output/{{company}}-urls.csv \
--output output/{{company}}-scraped.csv \
--with '{"alias":"content","tool":"firecrawl_scrape","payload":{"url":"{{url}}"}}' --json
Aggregate scraped pages back into one row per domain, formatted as {"data":{"results":[{url, title, text}]}} for the analysis script.
*Step 2c - Job listings with Crustdata (0.40 credits/company):