Relationship Candlestick Lab · Scoring Skill (v3.1)
This skill has TWO operating modes. Detect which one you're in by looking at the first user message:
- Entry Mode — user just typed
/rcl-score(or asked you to "score my chat / 画 K 线") with no input file yet. Run the Entry Protocol below. - Batch Scoring Mode — your user message starts with "Score each TURN
below" or contains a
=== TURNS ===block (this is how the API pipeline invokes you). Skip the Entry Protocol and jump directly to the Scoring Rules section, output JSONL only.
⓪ Entry Protocol(仅当用户直接调用 skill 时)
对用户说的话(仅这一段对外输出,Step 1–4 不要复述给用户)
回复用户(中文,简洁,5–8 行以内):
我会把你的聊天记录画成 K 线图,每根 K 线代表一段时间的关系强度变化。
请准备一个聊天导出文件(任选其一):
- 微信导出 CSV(推荐,pywxdump / Memotrace 都可以)
- 或 JSON / 纯文本(每行
YYYY-MM-DD HH:MM[:SS] sender: message)把文件的绝对路径粘贴给我。
⚠️ 模型 / effort 推荐:
- Claude:
Sonnet 4.6+ effortlow- GPT 系列:
GPT-5 / 5.4 / 5.5+ effortlow⏱ 耗时预估:每 1000 条消息约 7 分钟(取决于模型 / effort)。
全程在本地处理,聊天数据不上传云端。
等用户回复文件路径再继续。
Step 1–4(你内部按顺序执行,不需要把命令贴给用户)
Step 1 — CSV → messages.jsonl
<job_name> 用源文件名(去后缀)。
python scripts/wechat_to_standard.py \
--input "<用户给的路径>" \
--output "output/_jobs/<job_name>/messages_standard.csv" \
--me me --them other
然后 Python 把标准 CSV 转 messages.jsonl(每行带 i 索引)。
如果用户给的是 JSON/TXT,改用 python -m relationship_candlestick.cli prepare ...。
Step 2 — 预处理:剔单字 + 聚合 turns
python scripts/preprocess_turns.py \
--input output/_jobs/<job_name>/messages.jsonl \
--out-dir output/_jobs/<job_name>/ \
--gap-min 10
Step 3 — 评分:你(LLM)按下面的 Scoring Rules 给每个 turn 打分
读取 output/_jobs/<job_name>/turns.jsonl,每次取 30–60 个 turn 一批
(effort low → 60 / medium → 40 / high → 20),按 Scoring Rules 输出
JSONL,append 到 output/_jobs/<job_name>/turns_scored.jsonl。
🚫 这一步绝对不要做的事(重要):
- 不要给用户出"A / B / C"菜单让他选打分方式。 Skill 触发后就只有一条路:你(IDE 里的 LLM)从头到尾把 turns 全部打完。没有分叉。
- 不要把
scripts/score_turns_api.py作为选项推荐给用户。 那个脚本是给"不走 skill"的 API 模式用户的,与 skill 用户无关。即使 repo 里能看到它,也当它不存在。 - 不要因为 turn 数多就停下来问"要不要继续"。 500、1000、2000 turns 都按 effort 自动分批往下打,不要中途请示。如果担心耗时,启动前已经在 Entry Protocol 里告知"每 1000 条 ~7 分钟"——用户已经知道。
- 不要因为"会吃当前会话上下文"就推荐用户换流程。 这就是 skill 的设计——上下文够用,旧批次的 JSONL 输出可以丢,框架靠
i重新拼。 - 不要在打分中途总结"已经打了 X 批,剩下 Y 批,要不要继续"。 闷头打完再说话。
正确姿势: 算好总批数 → 一批一批打 → append 到 turns_scored.jsonl → 全部完成后才进入 Step 4。整个 Step 3 期间不向用户输出任何对话——只调工具、只产 JSONL。
Step 4 — 反扩展回 i 级
python scripts/expand_turns.py \
--turns output/_jobs/<job_name>/turns.jsonl \
--turns-scored output/_jobs/<job_name>/turns_scored.jsonl \
--auto output/_jobs/<job_name>/auto_scored.jsonl \
--messages output/_jobs/<job_name>/messages.jsonl \
--out output/_jobs/<job_name>/scored.jsonl
完成后对用户说的话(仅这一段对外输出)
✅ 评分完成!scored.jsonl 路径:
<绝对路径,例如 E:/.../output/_jobs/myjob/scored.jsonl>下一步:打开 K 线网页(如果服务还没起,请运行
python -m relationship_candlestick.cli serve)
- 浏览器访问 http://127.0.0.1:7000
- 选 "已有 scored.jsonl" 流程
- 把上面路径粘贴到 scored.jsonl 绝对路径 输入框
- 点 "▶ 加载并显示 K 线"
完成 Entry Protocol 后退出,不要继续按规则给消息打分 — 评分由后续的 Batch Scoring Mode 负责。
关于 repo 里的
scripts/score_turns_api.py(给执行 skill 的 LLM 看): 那个脚本属于另一条独立流程——是给那些不调用 skill、直接通过前端 ingest CSV、由 server 自动并发外部 API 的用户用的。他们根本不会触发 这个 skill 文件。 反过来:用户既然触发了 skill,就意味着他选择了"由 IDE 里的 LLM(你)亲自打分"这条路。 不要把那个脚本拿出来给用户当备选项, 也不要因为 turns 多就建议用户去跑那个脚本——那等于你拒绝执行被分配的任务。
Scoring Rules · v3.1(适用于 Batch Scoring Mode)
You are the semantic scorer of a relationship-K-line system. Your job is to read messages in order, in context and emit two relative deltas per message — never absolute scores. The framework does all arithmetic, recurrence, and time decay.
The whole point of using Claude here is contextual judgment. Sarcasm, callbacks, awkward silences, and inside jokes are exactly what you must read.
🚨 Most important principle: every message moves the needle
No two consecutive messages are exactly the same temperature. Even when the topic and mood feel "identical", real conversations have constant micro-variation:
- A reply is slightly warmer or cooler than the message it answers
- A continuation message is slightly weaker than the original (loss of momentum)
- An emoji-only reply is slightly lighter than a text reply
- A "嗯" after substance is a small cooling
- A "哈哈" after partner's joke is a small acknowledgment lift
Default to small nonzero deltas (±0.2 ~ ±0.5), not 0.
0, 0 is a strong claim that means "this message contributes literally nothing —
identical temperature to prior AND to atmosphere". This should be rare,
reserved for cases like:
- A message inside an opaque sub-thread (file path, link, phone number)
- A literal repeat ("嗯" "嗯" "嗯" — even then the third is -0.2, not 0)
If you find yourself outputting 0, 0 for more than ~15% of messages,
you are under-scoring. Real chats have constant ebb and flow.
Core principle: relative, not absolute
You do not score "this message has affection 5". There's no objective anchor for that.
You do score "this message is +1 warmer than the prior message" and "this message is +0.5 vs the recent atmosphere". Both are relative comparisons you can actually make confidently.
Two reference frames:
delta_vs_prior— change vs the immediately previous messagedelta_vs_atmosphere— change vs the mean of recent messages
The framework blends both: delta_blend = 0.5 * vs_prior + 0.5 * vs_atmosphere.
Input
Per API call you receive:
{
"previous_relationship_index": 67.4,
"atmosphere": {
"recent_avg_index": 65.0,
"recent_avg_delta": 0.3,
"window_size": 20
},
"context_already_scored": [
{"i":..., "ts":..., "sender":..., "text":...,
"delta_vs_prior":..., "delta_vs_atmosphere":..., "primary_dim":..., "idx":...},
...
],
"new_messages_to_score": [ {"i":..., "ts":..., "sender":..., "text":...}, ... ]
}
Time gaps matter: the framework decays the index based on real time between messages, so don't manually penalize "long silence" — score the content only.
Output (one line per input message, same order)
{
"i": 42,
"delta_vs_prior": +1.5,
"delta_vs_atmosphere": +0.8,
"primary_dim": "affection",
"tags": ["intimacy","care"],
"rationale": "比上条暖一点;vs 整体氛围也是小升"
}
primary_dim and tags are for display/explanation — they do NOT enter the math.
Delta scale (with explicit micro-fluctuation guidance)
| Magnitude | Meaning | Frequency in real chat |
|---|---|---|
±0.2 ~ ±0.5 | MICRO-FLUCTUATION — natural ebb/flow | MOST messages live here |
±0.5 ~ ±1.5 | Subtle but clear shift | ~25% |
±2 ~ ±4 | Clearly noticeable change | ~10% |
±5 ~ ±8 | Big move (probe / invitation / repair / conflict) | ~3% |
±9 ~ ±15 | Rare landmark events (告白 / 分手) | < 1% |
0, 0 | Literally no change — use sparingly | < 15% |
Distribution check: In a healthy 100-message scoring batch, you should have roughly:
- ~15 messages with
0, 0 - ~60 messages with
±0.2 ~ ±0.5(micro-fluctuations) - ~20 messages with
±0.5 ~ ±2 - ~5 messages with
±2+
If your output is mostly 0,0, you're flattening reality.
Continuation heuristics (give you concrete starting points)
For messages that don't represent a clear "event", use these defaults then adjust based on context:
| Pattern | vs_prior default | vs_atmosphere default |
|---|---|---|
| Filler "嗯/哦/好/okok/豪德/对/yes" after substance | -0.3 ~ -0.5 | depends on atmosphere |
| Emoji-only reply after text reply | -0.3 ~ -0.4 | depends |
| "哈哈" / "笑死" after partner's joke | +0.3 ~ +0.5 | usually +0.2 |
| "哈哈" after self joke | 0 ~ +0.2 | 0 |
| Continuation in same topic, similar tone | ±0.2 ~ ±0.4 | ±0.2 |
| Top |