phdtaketaketake — Connection-first PhD advisor matcher

⚠️ CARDINAL RULE — REAL DATA ONLY

Every connection edge, every candidate fact, every signal value MUST trace back to a real source you actually fetched via web search. Fabrication is strictly forbidden — students use these rankings to decide where they spend years of their life. Made-up data is worse than no data.

The contract:

✅ Verified via web search → record value + structured EvidenceSource (URL + source_type + claim + supports_fields)
✅ Searched but found nothing → leave the field empty / set signal to "missing"
❌ Guessed from training memory → NOT ALLOWED
❌ Inferred from name patterns / school proximity / "feels likely" → NOT ALLOWED
❌ Estimated without any web search → NOT ALLOWED

Two enforcement layers:

Risk-adjusted ranking — wide confidence bands move candidates down the sort order. The agent literally cannot get a top rank with unsourced claims; the band widens and risk_adjusted_strength = strength − band/2 drops below better-evidenced peers.
--strict-evidence flag — when run with this flag, scripts/match.py rejects any candidate that has unsourced claims (a value set without an EvidenceEntry). Missing signals (no value, no evidence) are still allowed — they're honest "I couldn't verify" states. Use this when the user is making real application decisions.

The matcher's confidence band (±0.2 / 0.4 / 0.6 / 0.8 — see §Confidence calibration below) handles missing data gracefully, AND the risk-adjusted ranking subtracts band/2 from the sort key — so wide bands move candidates down the list. A wide band on real data is far more useful than a narrow band on made-up data.

Full allowed-source list and forbidden-behavior catalog: references/data_integrity.md. Read it before doing any connection research.

This skill ranks candidate PhD advisors using a 5-layer deterministic pipeline — each layer composes the layer below; every score traces back to cited evidence:

1. CAPEG match_score  = w_C·C + w_A·A + w_P·P + w_E·E + w_G·G
                        (tier-adaptive weights; w_C > w_A in every tier
                         — connection-first invariant)
2. application_strength = clip(match_score + opportunity_adj, 0, 4.0)
3. risk_adjusted_strength       = application_strength − band/2
4. difficulty_adjusted_strength = max(0, risk_adjusted_strength − program_penalty)
                                  ← PRIMARY SORT KEY (post-#5)
5. strategy bucket  = bucket(difficulty_adjusted, evidence, …)
                      → priority / target / reach / only_if_space / drop
                      (purely derivative — never modifies any score)

5 CAPEG pillars on a 4.0 scale:

Connection (C) — verified path between candidate PI ↔ student's current advisor: small-team coauthor, big-collab paper overlap, working group, analysis contact, genealogy, shared grant, co-mentored student, committee/exam, same center, prior-institution overlap, conference session. v2 aggregation: strongest + 0.10·second_strongest, capped at 1.0, scaled by recency.
Advisor influence (A) — PI reputation only (post-#6a): 0.40·influence + 0.30·elite_status + 0.30·grad_placement_quality. Funding and recruiting moved to Opportunity (O).
Publication (P) — field-aware tier × author-role × status × recency × contribution-bonus, with big-collab and consortium guardrails (min(0.10, n/100) cap on alphabetical co-authorship).
Experience (E) — 0.20·lab_prestige + 0.30·duration + 0.50·output, strongest single experience.
GPA (G) — direct on 4.0; 4.3 / 4.5 / 100 / UK honours normalized.

3 non-CAPEG dimensions:

Opportunity (O) — admit-cycle availability: `recruiting_health
- active_funding_quality + lab_capacity + grant_timing + availability. Drives opportunity_adj(replaces v1pi_adj); not_recruitingforcesapplication_strength=0`.
Program difficulty (D) — per-program penalty 0–0.8 from school- tier admit rate + cohort size + admission model + funding structure
- faculty count + international friendliness. Subtracted from risk_adjusted_strength to form difficulty_adjusted_strength (the primary sort key, replaces v1 tier_adj).
Research fit (R) — structured 6-axis tie-breaker: `0.30·topic
- 0.20·method + 0.15·system + 0.15·temporal + 0.10·grant + 0.10·background`. Never a 6th pillar; sorts ties only.

Pipeline diagram: docs/scoring_pipeline.md. Full formulas: docs/scoring.md. Per-feature references in references/.

How users actually invoke this skill (natural language)

Users on QClaw / Claude Code / any agent platform will not write JSON themselves. The expected entry shape is conversational:

"我是 2027 fall 申请 Physics PhD,方向是 ATLAS Higgs / detector ML。本科 UCI,GPA 3.85/4.0,有两篇 ATLAS big-collab paper,导师是 Prof. X。请帮我找美国 top 10–30 的匹配 PI,并按 phdtaketaketake 的 evidence-first 规则给出排序和申请策略。"

"I'm applying for biology PhDs this fall, focusing on cancer immunology. Berkeley undergrad, GPA 3.9, one first-author Cell paper, advisor is Prof. Y. Find me 8 advisors at top US programs."

Your job as the agent: translate this into the structured StudentProfile + candidate-discovery workflow below. Do not ask the user to fill JSON. Do not ask for the schema upfront. Ask for missing facts in plain English, one round at a time.

Required information to ask for (if not given)

If the user's first message is missing any of these, ask before doing deep research — running the pipeline without them produces low-confidence output:

Field / subfield (e.g. "physics / HEP" → resolves to FieldProfile)
Undergrad institution + GPA (with scale: 4.0 / 4.3 / 4.5 / 100 / UK honours)
Research direction (1–2 sentences — the matcher uses this for research_fit)
Current advisor(s) (name + institution — drives the C pillar; without this, connection-first matching is degraded and the matcher prints a stderr warning)
Target school tier or list (top_10 / top_11_30 / top_31_60 / top_60_plus, OR a list of school names)

Optional but improves output quality

Papers: title, venue, status (published / accepted / submitted / preprint / in_prep), author position, total authors. Without this, P pillar floors out; user gets an honest "no publication evidence" rather than a guess.
Experiences: lab name, duration months, output (paper / poster / thesis). Without this, E pillar floors.
Specific candidate PIs: if the user already has a target list, skip Step 3 (discovery_plan) and feed candidates straight to collect_evidence. If not, run discovery_plan first.
Theory / experiment crossover preferences (physics-specific): affects research_fit.theory_experiment_fit signal.
International friendliness needs (visa / funding constraints): affects program_difficulty interpretation.

Minimum viable run

The smallest run that produces useful output:

field + undergrad + gpa + research_direction + 1 current_advisor
+ target tier (e.g. "top_10")

Even with no candidate list, the agent can run discovery-plan to generate per-school search queries, then collect-evidence on agent-discovered candidates, then match. Missing optional fields widen the confidence band but do not crash.

Two-layer output contract

What the user sees vs what power users / strict-mode auditors get:

Per-candidate cards (rendered by you, the agent) — the human- readable presentation. Format defined in §"How to present results to the user" below.
Full match.json (raw MatchResult JSON) — kept as power-user appendix; never the primary user-facing artifact.

Step 0 — Load the FieldProfile

Before running any deep-research, **load the Field

phdtaketaketake

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday