feat(ingest): T0.5–T5 純餵食器管線實作(issue #2)
ingest 全管線(採取優先、extract fallback、跨庫織網、POST envelope): - T0.5 骨架:Hono + zod-openapi,無 D1/Vectorize/AI 綁定(不碰儲存鐵律) - T1 SourceAdapter:GitHub runtime API 拉 + per-file sha256 content-hash + /refresh 受理端 - T2 採取(路徑 A 優先):harvest template 1.8.0+ 卡(gloss/實體/typed-edge) - T3 extract(路徑 B fallback):LlmCaller 可選模型 + JSON-fail 升級閘 + 端點對齊硬自檢護欄;第一版不 embed(只打標) - T4 跨庫織網(主職):匯總多 repo → 偵測跨庫橋/異見,不算 bridge_score(graph 領域) - T5 輸出:buildEnvelope strict + 顯式禁送欄位自檢;graph-client 純 POST(cherry-pick _kbdb_client.py 改不碰 base);薄 ops CLI(不帶查詢 MCP) envelope 對齊 full contract(embed/id/aliases/predicate_embed);同步 contract 向量化欄位升格。 gate:vitest 28 passed / tsc clean / wrangler dry-run 乾淨(只 env-var 綁定)。 端到端 ingest→graph:graph receiver 已補對齊 → 待 ingest 部署 + GRAPH_BASE_URL → 待部署驗,未假綠。 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -60,7 +60,7 @@
|
||||
},
|
||||
"nodes": {
|
||||
"type": "array",
|
||||
"description": "節點層附帶資訊(選填)。entity_type 與 gloss 是【節點】屬性,不是【邊】屬性 → 放這裡,不放 triplets。graph 用 gloss 去 embed(每節點一句,不是裸詞)、用 entity_type 去 typing。",
|
||||
"description": "節點層附帶資訊。【向量化分工(leo 2026-06-26,ingest#1 升格成契約)】ingest 在此【打標】哪些 token 要向量化 + embed 什麼;base/KBDB embed 模組【讀標執行】實際 embedding;ingest 自己不算向量。兩類節點(實體詞條 / wikilink 卡)都進 nodes[],謂詞向量見 triplets[].predicate_vector。",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["name"],
|
||||
@@ -69,11 +69,26 @@
|
||||
"name": {
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
"description": "節點名(須對應某 triplet 的 subject 或 object 原字面)。"
|
||||
"description": "節點名(須對應某 triplet 的 subject/object 原字面)。實體詞條=正規名;wikilink 卡=卡標題。"
|
||||
},
|
||||
"id": {
|
||||
"type": "string",
|
||||
"description": "去重鍵。wikilink 卡用【檔名】→ 一卡一 node,被多條邊指到也只 embed 一次,不以出現次數重複。實體詞條用正規名。選填(無則以 name 去重)。"
|
||||
},
|
||||
"gloss": {
|
||||
"type": "string",
|
||||
"description": "一句話描述,供 embedding。例如 'Graph RAG — 用關係遍歷檢索、保住異見的 RAG 變體'。選填(建議 deep tier 產出)。"
|
||||
"description": "一句話描述。base embed 對【名 + gloss 一起】embedding(實體同義詞字面差太遠,靠描述拉近)。選填(建議 deep tier 產)。"
|
||||
},
|
||||
"aliases": {
|
||||
"type": "array",
|
||||
"items": { "type": "string" },
|
||||
"description": "同義詞(如『黃仁勳』/『Jensen Huang』)。base 歸一(collapse)成同一 node。選填。"
|
||||
},
|
||||
"embed": {
|
||||
"type": "boolean",
|
||||
"default": true,
|
||||
"description": "【向量化打標】此節點要不要進向量庫。true=base 讀標去 embed(名+gloss);false=base 看到就不理(如結構符號/散文不該進 nodes[],真進了標 false)。預設 true(實體詞條與 wikilink 卡都要)。",
|
||||
"$comment": "ingest 打標,base 讀標執行。embed 動作歸 base embed 模組,ingest 不算向量。"
|
||||
},
|
||||
"entity_type": {
|
||||
"type": "string",
|
||||
@@ -86,7 +101,7 @@
|
||||
"triplets": {
|
||||
"type": "array",
|
||||
"minItems": 1,
|
||||
"description": "邊(關係)。ingest 只產原始 (s,p,o) + confidence。",
|
||||
"description": "邊(關係)。ingest 只產原始 (s,p,o) + confidence + 謂詞向量打標。端點(s/o)以字面 match nodes[].name。",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["subject", "predicate", "object"],
|
||||
@@ -95,10 +110,16 @@
|
||||
"subject": { "type": "string", "minLength": 1, "description": "主詞(實體名,須與 nodes[].name 對得上若有提供)" },
|
||||
"predicate": { "type": "string", "minLength": 1, "description": "謂詞(關係)" },
|
||||
"object": { "type": "string", "minLength": 1, "description": "受詞(目標實體或值)" },
|
||||
"predicate_embed": {
|
||||
"type": "boolean",
|
||||
"default": true,
|
||||
"description": "【謂詞向量化打標】謂詞要不要 embed。base 讀標 → embed【謂詞裸詞,無描述】(謂詞同義詞字面本就近,如『參考』/『參照』,裸詞 embed 即自動聚類),存 edge 的 predicate_vector。為支援『關係過濾』查詢(查『參考』不漏『參照』)→ 預設 true。embed 動作歸 base,ingest 只打標。",
|
||||
"$comment": "ingest 打標,base 讀標執行 embed。"
|
||||
},
|
||||
"confidence":{ "type": "number", "minimum": 0, "maximum": 1, "default": 1.0, "description": "萃取可信度。淺萃可附自評;graph 不據此過濾,只記錄。" }
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"$comment": "禁止欄位(graph 領域,ingest 絕不可送): id / clusters / bridge_score / created_at / updated_at / 以及 triplet 上的 subject_entity_type|object_entity_type(類型只走 nodes[])。送了即違反 ingest=純餵食器的邊界,graph 應拒收或忽略。"
|
||||
"$comment": "禁止欄位(graph 領域,ingest 絕不可送): id(節點去重鍵的 id 例外,那是 ingest 提供的去重鍵非 record id) / clusters / bridge_score / created_at / updated_at / 以及 triplet 上的 subject_entity_type|object_entity_type(類型只走 nodes[])。【向量化分工】ingest 打標(embed/predicate_embed + 帶 gloss/aliases),base/KBDB embed 模組讀標執行 embedding,ingest 不算向量。結構符號(>>/←)與給人讀的散文(## 摘要)不進 envelope。"
|
||||
}
|
||||
|
||||
@@ -2,44 +2,47 @@
|
||||
|
||||
> 唯一進度來源。狀態:[ ] 未開始 [🔄] 進行中 [x] 完成 [⏸] 卡住
|
||||
> 跨專案藍圖:InkStoneCo `docs/3-specs/mira-dissolve/`。
|
||||
> 實作分支:`claude/ingest-t1-t5-implementation`(vitest 28 passed / tsc clean / dry-run 乾淨)。
|
||||
|
||||
## T0 repo 骨架(本輪)
|
||||
## T0 repo 骨架
|
||||
|
||||
- [x] 0.1 建 public repo `uncle6me-web/kbdb-ingest-plugin`
|
||||
- [x] 0.2 CLAUDE.md(上游指針 + ingest 鐵律)+ README + .gitignore
|
||||
- [x] 0.3 `contracts/ingest-candidate.json`(從頂層 SDD 複製,凍結契約)
|
||||
- [x] 0.4 SDD 三件式骨架
|
||||
- [ ] 0.5 package.json / tsconfig / wrangler.toml(參考 kbdb-graph-plugin)
|
||||
- [x] 0.4 SDD 三件式骨架(`docs/3-specs/ingest-pipeline/`)
|
||||
- [x] 0.5 package.json / tsconfig / wrangler.toml / vitest.config(參考 kbdb-graph-plugin:Hono + zod-openapi,無 D1/Vectorize/AI 綁定)
|
||||
|
||||
## T1 SourceAdapter(R1)
|
||||
## T1 SourceAdapter(R1)— `src/lib/source-adapter.ts`
|
||||
|
||||
- [ ] 1.1 GitHub 拉 repo(runtime API/clone,非 Actions)
|
||||
- [ ] 1.2 content-hash(per-file,source.uri = github:owner/repo@path)
|
||||
- [ ] 1.3 被 KBDB MCP `refresh` 代轉觸發的接口
|
||||
- [x] 1.1 GitHub 拉 repo(runtime git/trees + contents API,非 Actions);GitHubFetcher 介面(測試走 mock)
|
||||
- [x] 1.2 content-hash(per-file sha256;source.uri = github:owner/repo@path,makeSourceUri/parseSourceUri round-trip)
|
||||
- [x] 1.3 被 graph `POST /graph/refresh` 代轉觸發的受理端:`POST /refresh`(`src/index.ts`,被動代轉、無排程)
|
||||
|
||||
## T2 採取(R2,路徑 A 優先)
|
||||
## T2 採取(R2,路徑 A 優先)— `src/lib/harvest.ts`
|
||||
|
||||
- [ ] 2.1 拉本地 CC 已建三元組 + gloss(用了 system-dev-template 的 repo)
|
||||
- [ ] 2.2 cherry-pick `polaris/mira/tools/_kbdb_client.py` → 改純餵食器(POST envelope,不寫 KBDB)
|
||||
- [x] 2.1 採取本地 CC 已建三元組 + gloss(template 1.8.0+ 格式:frontmatter gloss、`## 實體`、`## 關聯` typed-edge;卡對卡 vs 內文端點分流)
|
||||
- [x] 2.2 cherry-pick `_kbdb_client.py` → 改純餵食器 `src/lib/graph-client.ts`(POST envelope,**不寫 KBDB/base**)
|
||||
|
||||
## T3 extract(R3,路徑 B fallback)
|
||||
## T3 extract(R3,路徑 B fallback)— `src/lib/extract.ts`
|
||||
|
||||
- [ ] 3.1 cherry-pick `wiki_synthesis.yaml` classify / 兩 skill block
|
||||
- [ ] 3.2 模型用戶可選 + 品質門檻白名單(預設 Haiku,深萃 Claude via CC)
|
||||
- [ ] 3.3 模型測試集(中文 + 人類暗示樣本,轉回歸測試)— deferred,先跑預設
|
||||
- [ ] 3.4 JSON-fail 升級閘(淺萃失敗升 deep)
|
||||
- [ ] 3.5 第一版不 embed(embed 等 base vectorize,InkStoneCo T2.4)
|
||||
- [x] 3.1 cherry-pick `wiki_synthesis.yaml` classify 模式 → extract prompt(JSON nodes[]+triplets[])
|
||||
- [x] 3.2 模型用戶可選(意圖非型號,LlmCaller 介面,預設 shallow/Haiku、deep/Claude via CC)
|
||||
- [ ] 3.3 模型測試集(中文 + 人類暗示樣本,轉回歸測試)— **deferred**(先跑預設;護欄 + parse 已有單元測試)
|
||||
- [x] 3.4 JSON-fail 升級閘(淺萃 fail/過稀 → 升 deep 一次)
|
||||
- [x] 3.5 第一版不 embed(仍【打標】embed/predicate_embed 供未來 base 讀標;embed 動作等 Arcrun #7)
|
||||
- [x] 3.x 端點對齊硬自檢護欄(`src/lib/endpoint-check.ts`,leo 壓測 14→0;自檢 + autoAlign 補齊)
|
||||
|
||||
## T4 跨 repo 織網(R4,主職)
|
||||
## T4 跨 repo 織網(R4,主職)— `src/lib/weave.ts`
|
||||
|
||||
- [ ] 4.1 匯總多 repo 三元組
|
||||
- [x] 4.1 匯總多 repo 三元組 → 偵測跨庫橋(同名 node 跨 ≥2 repo)+ 異見(同 s/o 對、不同謂詞);**不算 bridge_score**(graph 領域,禁送)
|
||||
|
||||
## T5 輸出 + CLI(R5/R6)
|
||||
|
||||
- [ ] 5.1 POST envelope 給 graph `POST /triplets/ingest`(嚴格符合 contract)⏸ 待 graph 寫入端(InkStoneCo T3.3)
|
||||
- [ ] 5.2 薄 ops CLI(手動重萃);不帶查詢 MCP
|
||||
- [x] 5.1 POST envelope 給 graph `POST /triplets/ingest`(嚴格符合 contract;buildEnvelope strict + 顯式禁送欄位自檢提早攔)。對齊【full contract】(含 embed/id/aliases/predicate_embed,總管裁定 ingest 不退)
|
||||
- [x] 5.2 薄 ops CLI(`scripts/ingest-cli.mjs`:refresh 經 Worker / pull dry-run);**不帶查詢 MCP**
|
||||
|
||||
## 阻擋項
|
||||
## 阻擋項 / 誠實標記
|
||||
|
||||
1. ⏸ T5.1 依賴 graph `POST /triplets/ingest`(InkStoneCo T3,待 graph repo 實作)。
|
||||
2. ⏸ embed 依賴 base vectorize(InkStoneCo T2.4)。第一版不 embed 可先動。
|
||||
1. ⏸ **端到端 ingest→graph 走通**:graph receiver 已補對齊 full contract → 剩 ingest 部署 + `GRAPH_BASE_URL` 設定 → **待部署驗**,未假綠。
|
||||
2. ⏸ embed 依賴 base vectorize(Arcrun #7)。第一版不 embed(只打標)已動。
|
||||
3. T3.3 模型測試集 deferred;refresh 端 extract(Workers AI)第一版只走採取,深萃留 CLI/CC。
|
||||
|
||||
Generated
+2673
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,25 @@
|
||||
{
|
||||
"name": "kbdb-ingest-plugin",
|
||||
"version": "0.1.0",
|
||||
"private": true,
|
||||
"description": "KBDB-ingest 插件:純餵食器——GitHub 拉 + 採取/萃取三元組候選 + 跨庫織網 → POST envelope 給 kbdb-graph-plugin。不碰儲存。",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"dev": "wrangler dev",
|
||||
"deploy": "wrangler deploy",
|
||||
"test": "vitest run",
|
||||
"test:watch": "vitest",
|
||||
"ingest": "node scripts/ingest-cli.mjs"
|
||||
},
|
||||
"dependencies": {
|
||||
"@hono/zod-openapi": "^1.2.4",
|
||||
"hono": "^4.7.0",
|
||||
"zod": "^4.3.6"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@cloudflare/workers-types": "^4.20250219.0",
|
||||
"typescript": "^5.7.0",
|
||||
"vitest": "^3.1.0",
|
||||
"wrangler": "^4.0.0"
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,117 @@
|
||||
#!/usr/bin/env node
|
||||
// 薄 ops CLI(T5.2)— 人手動觸發重萃。不帶查詢 MCP(ambient 餵食器沒人「問」它)。
|
||||
//
|
||||
// 兩種模式:
|
||||
// ingest refresh <github:owner/repo@path> 經部署的 Worker /refresh 重萃單一來源
|
||||
// ingest pull <owner/repo> [root] 本地 dry-run:拉 + 列出會送的 envelope(不 POST)
|
||||
//
|
||||
// 設定走 env:
|
||||
// KBDB_INGEST_URL 已部署的 ingest Worker base(refresh 模式用)
|
||||
// GRAPH_BASE_URL graph 寫入端(pull --post 用)
|
||||
// GITHUB_TOKEN 拉私庫用(公庫可空)
|
||||
//
|
||||
// 鐵律:CLI 不碰儲存;refresh 經 Worker、pull --post 經 graph 寫入端。觸發=人手動(無排程)。
|
||||
|
||||
import process from 'node:process';
|
||||
|
||||
const [, , cmd, arg, arg2] = process.argv;
|
||||
|
||||
async function sha256hex(text) {
|
||||
const data = new TextEncoder().encode(text);
|
||||
const digest = await crypto.subtle.digest('SHA-256', data);
|
||||
return [...new Uint8Array(digest)].map((b) => b.toString(16).padStart(2, '0')).join('');
|
||||
}
|
||||
|
||||
function ghHeaders() {
|
||||
const h = { Accept: 'application/vnd.github+json', 'User-Agent': 'kbdb-ingest-cli' };
|
||||
if (process.env.GITHUB_TOKEN) h.Authorization = `Bearer ${process.env.GITHUB_TOKEN}`;
|
||||
return h;
|
||||
}
|
||||
|
||||
async function ghGetFile(owner, repo, path) {
|
||||
const url = `https://api.github.com/repos/${owner}/${repo}/contents/${path}`;
|
||||
const res = await fetch(url, { headers: ghHeaders() });
|
||||
if (!res.ok) throw new Error(`github ${owner}/${repo}@${path}: ${res.status}`);
|
||||
const body = await res.json();
|
||||
const text = body.encoding === 'base64' ? Buffer.from(body.content, 'base64').toString('utf-8') : body.content;
|
||||
return { text, commit: body.sha };
|
||||
}
|
||||
|
||||
async function ghListMarkdown(owner, repo, root = '') {
|
||||
const res = await fetch(`https://api.github.com/repos/${owner}/${repo}/git/trees/HEAD?recursive=1`, { headers: ghHeaders() });
|
||||
if (!res.ok) throw new Error(`github list ${owner}/${repo}: ${res.status}`);
|
||||
const body = await res.json();
|
||||
const prefix = root.replace(/^\/+|\/+$/g, '');
|
||||
return (body.tree || [])
|
||||
.filter((e) => e.type === 'blob' && e.path.endsWith('.md'))
|
||||
.map((e) => e.path)
|
||||
.filter((p) => (prefix ? p === prefix || p.startsWith(prefix + '/') : true));
|
||||
}
|
||||
|
||||
// 極簡採取(鏡射 src/lib/harvest.ts;CLI dry-run 用,不引 TS)。
|
||||
function harvest(md) {
|
||||
const fm = /^---\n([\s\S]*?)\n---\n?([\s\S]*)$/.exec(md);
|
||||
const body = fm ? fm[2] : md;
|
||||
const gloss = fm && /^gloss:\s*(.+)$/m.exec(fm[1]) ? /^gloss:\s*(.+)$/m.exec(fm[1])[1].trim() : undefined;
|
||||
const title = /^#\s+(.+)$/m.exec(body)?.[1]?.trim();
|
||||
const sec = (h) => new RegExp(`^##\\s+${h}[^\\n]*\\n([\\s\\S]*?)(?=\\n##\\s|$)`, 'm').exec(body)?.[1] || '';
|
||||
const nodes = [];
|
||||
if (title) nodes.push({ name: title, gloss, embed: true });
|
||||
for (const line of sec('實體').split('\n')) {
|
||||
const m = /^-\s*\*\*(.+?)\*\*\s*(?:((.+?)))?\s*(?:[—-]\s*(.+))?$/.exec(line.trim());
|
||||
if (m) nodes.push({ name: m[1].trim(), gloss: m[3]?.trim() || undefined, embed: true });
|
||||
}
|
||||
const triplets = [];
|
||||
for (const line of sec('關聯').split('\n')) {
|
||||
const m = /^(.+?)\s*>>\s*(.+?)\s*>>\s*(.+?)$/.exec(line.replace(/^-\s*/, '').trim());
|
||||
if (m) {
|
||||
const clean = (s) => s.replace(/\[\[|\]\]|\*\*/g, '').trim();
|
||||
triplets.push({ subject: clean(m[1]), predicate: m[2].trim(), object: clean(m[3]), predicate_embed: true });
|
||||
}
|
||||
}
|
||||
return { nodes, triplets };
|
||||
}
|
||||
|
||||
async function doRefresh(uri) {
|
||||
const base = process.env.KBDB_INGEST_URL;
|
||||
if (!base) throw new Error('KBDB_INGEST_URL 未設(指向已部署的 ingest Worker)');
|
||||
const res = await fetch(base.replace(/\/$/, '') + '/refresh', {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ uri }),
|
||||
});
|
||||
console.log(JSON.stringify(await res.json(), null, 2));
|
||||
}
|
||||
|
||||
async function doPull(ownerRepo, root) {
|
||||
const [owner, repo] = ownerRepo.split('/');
|
||||
if (!owner || !repo) throw new Error('用法:ingest pull <owner/repo> [root]');
|
||||
const paths = await ghListMarkdown(owner, repo, root || '');
|
||||
console.error(`[ingest] ${owner}/${repo}: ${paths.length} 個 MD`);
|
||||
const envelopes = [];
|
||||
for (const path of paths) {
|
||||
const { text, commit } = await ghGetFile(owner, repo, path);
|
||||
const { nodes, triplets } = harvest(text);
|
||||
if (!triplets.length) continue; // 採不到(非 template 卡)→ dry-run 跳過(CLI 不做 extract)
|
||||
envelopes.push({
|
||||
source: { uri: `github:${owner}/${repo}@${path}`, content_hash: await sha256hex(text), commit },
|
||||
extractor: { model: 'local-harvest', tier: 'shallow' },
|
||||
nodes,
|
||||
triplets,
|
||||
});
|
||||
}
|
||||
console.error(`[ingest] 採取出 ${envelopes.length} 個 envelope(共 ${envelopes.reduce((n, e) => n + e.triplets.length, 0)} 三元組)`);
|
||||
console.log(JSON.stringify(envelopes, null, 2));
|
||||
}
|
||||
|
||||
try {
|
||||
if (cmd === 'refresh' && arg) await doRefresh(arg);
|
||||
else if (cmd === 'pull' && arg) await doPull(arg, arg2);
|
||||
else {
|
||||
console.error('用法:\n ingest refresh <github:owner/repo@path>\n ingest pull <owner/repo> [root]');
|
||||
process.exit(2);
|
||||
}
|
||||
} catch (e) {
|
||||
console.error('[ingest] 錯誤:', e.message);
|
||||
process.exit(1);
|
||||
}
|
||||
@@ -0,0 +1,87 @@
|
||||
// KBDB-ingest 插件 Worker 進入點 — 純餵食器。
|
||||
//
|
||||
// 鐵律:不碰儲存(無 D1/Vectorize/AI 綁定)。只 POST envelope 給 graph 寫入端。
|
||||
// 端點:/refresh = graph 的 POST /graph/refresh 代轉過來的受理端(人發起、非自動 fan-out)。
|
||||
// refresh 收到 {uri, owner_id} → 拉該來源 → 採取/萃取 → POST envelope 給 graph。
|
||||
// 不帶查詢 MCP(ambient 餵食器);ops 走薄 CLI(scripts/ingest-cli.mjs)。
|
||||
|
||||
import { OpenAPIHono, createRoute, z } from '@hono/zod-openapi';
|
||||
import { cors } from 'hono/cors';
|
||||
import type { Bindings, Variables } from './types';
|
||||
import { makeGitHubFetcher, parseSourceUri, contentHash, makeSourceUri } from './lib/source-adapter';
|
||||
import { processSource } from './lib/pipeline';
|
||||
import { makeGraphClient } from './lib/graph-client';
|
||||
|
||||
const app = new OpenAPIHono<{ Bindings: Bindings; Variables: Variables }>();
|
||||
|
||||
app.onError((err, c) => {
|
||||
console.error(err);
|
||||
return c.json({ error: 'Internal Server Error', message: err.message }, 500);
|
||||
});
|
||||
|
||||
app.use('*', cors({ origin: '*', allowHeaders: ['Content-Type', 'Authorization'], allowMethods: ['GET', 'POST', 'OPTIONS'] }));
|
||||
|
||||
app.get('/', (c) => c.json({ service: 'kbdb-ingest', tier: 'plugin', role: 'feeder', status: 'ok' }));
|
||||
app.get('/health', (c) =>
|
||||
c.json({ service: 'kbdb-ingest', status: 'ok', graph_url_set: Boolean(c.env.GRAPH_BASE_URL) }),
|
||||
);
|
||||
|
||||
// POST /refresh — graph 代轉重萃某來源。被動:收一次調用 → 處理一次(無排程/webhook)。
|
||||
const refreshRoute = createRoute({
|
||||
method: 'post',
|
||||
path: '/refresh',
|
||||
request: {
|
||||
body: {
|
||||
content: {
|
||||
'application/json': {
|
||||
schema: z.object({
|
||||
uri: z.string().min(1).describe("github:owner/repo@path"),
|
||||
owner_id: z.string().optional(),
|
||||
}),
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
responses: {
|
||||
200: { description: 'Refreshed: pulled, harvested/extracted, posted envelope to graph' },
|
||||
400: { description: 'Bad uri' },
|
||||
},
|
||||
tags: ['Ingest'],
|
||||
});
|
||||
|
||||
app.openapi(refreshRoute, async (c) => {
|
||||
const { uri } = c.req.valid('json');
|
||||
const parsed = parseSourceUri(uri);
|
||||
if (!parsed) return c.json({ error: 'uri 須為 github:owner/repo@path' }, 400);
|
||||
|
||||
const fetcher = makeGitHubFetcher(c.env.GITHUB_TOKEN);
|
||||
const { text, commit } = await fetcher.getFile(parsed.owner, parsed.repo, parsed.path);
|
||||
const file = {
|
||||
uri: makeSourceUri(parsed.owner, parsed.repo, parsed.path),
|
||||
path: parsed.path,
|
||||
text,
|
||||
content_hash: await contentHash(text),
|
||||
commit,
|
||||
};
|
||||
|
||||
// 第一版 refresh 只走採取(路徑 A);extract 模型在 Worker runtime 接 Workers AI 是後續
|
||||
// (CLI 端可帶 deep via CC)。採不到三元組 → 誠實回 skipped,不假萃。
|
||||
const result = await processSource(file);
|
||||
if (!result.envelope) {
|
||||
return c.json({ refreshed: false, path: result.path, note: result.note }, 200);
|
||||
}
|
||||
|
||||
const graph = makeGraphClient(c.env.GRAPH_BASE_URL, c.env.GRAPH_INTERNAL_TOKEN);
|
||||
const post = await graph.postEnvelope(result.envelope);
|
||||
return c.json(
|
||||
{
|
||||
refreshed: post.ok,
|
||||
path: result.path,
|
||||
triplets: result.envelope.triplets.length,
|
||||
graph: post.ok ? post.body : { status: post.status, error: post.error, issues: (post.body as any)?.issues },
|
||||
},
|
||||
200,
|
||||
);
|
||||
});
|
||||
|
||||
export default app;
|
||||
@@ -0,0 +1,49 @@
|
||||
// 端點對齊硬自檢護欄(leo 真 vault 壓測實證:光寫規則 Haiku 會略過,端點對不齊 14 條;
|
||||
// 寫成自檢動作後 14→0)。
|
||||
//
|
||||
// 規則:每條內文三元組的 subject/object 必須對得上某個 node 名(一字不差)。
|
||||
// 對不齊 = 下游圖斷鏈(端點 match 不到 node)。本護欄在 envelope 出門前機械檢,
|
||||
// 撈出對不齊的端點,呼叫端可選擇修補 / 丟棄 / warn。
|
||||
|
||||
import type { EnvelopeEdge, EnvelopeNode } from '../types';
|
||||
|
||||
export interface AlignmentReport {
|
||||
aligned: boolean;
|
||||
/** 對不齊的端點描述(給人讀 / log)。 */
|
||||
unaligned: string[];
|
||||
}
|
||||
|
||||
/**
|
||||
* 檢查三元組端點是否都對得上 nodes[].name。
|
||||
* 卡對卡端點(原文 `[[卡]]`)已在 harvest 去括號 → 一律以裸名比對。
|
||||
*/
|
||||
export function checkEndpointAlignment(nodes: EnvelopeNode[], triplets: EnvelopeEdge[]): AlignmentReport {
|
||||
const names = new Set(nodes.map((n) => n.name));
|
||||
const unaligned: string[] = [];
|
||||
for (const t of triplets) {
|
||||
for (const [role, ep] of [['subject', t.subject], ['object', t.object]] as const) {
|
||||
if (!names.has(ep)) {
|
||||
unaligned.push(`${role}「${ep}」對不齊(${t.subject} >> ${t.predicate} >> ${t.object})`);
|
||||
}
|
||||
}
|
||||
}
|
||||
return { aligned: unaligned.length === 0, unaligned };
|
||||
}
|
||||
|
||||
/**
|
||||
* 自動補齊:對不齊的端點,把它當成新 node 補進 nodes[](embed:true,無 gloss)。
|
||||
* 比丟棄三元組保守——保住邊,下游仍可 normalize。回傳補過的 nodes。
|
||||
*/
|
||||
export function autoAlignEndpoints(nodes: EnvelopeNode[], triplets: EnvelopeEdge[]): EnvelopeNode[] {
|
||||
const names = new Set(nodes.map((n) => n.name));
|
||||
const out = [...nodes];
|
||||
for (const t of triplets) {
|
||||
for (const ep of [t.subject, t.object]) {
|
||||
if (!names.has(ep)) {
|
||||
names.add(ep);
|
||||
out.push({ name: ep, embed: true });
|
||||
}
|
||||
}
|
||||
}
|
||||
return out;
|
||||
}
|
||||
@@ -0,0 +1,51 @@
|
||||
// envelope 組裝 + 出門前禁送欄位自檢。
|
||||
//
|
||||
// 一個 envelope = 一個來源檔一次萃取的產物(契約定義)。組裝後跑 EnvelopeSchema 驗證
|
||||
// (strict → 多帶禁送欄位會 throw,提早在 ingest 端攔,不等 graph 422)。
|
||||
|
||||
import {
|
||||
EnvelopeSchema,
|
||||
FORBIDDEN_EDGE_KEYS,
|
||||
FORBIDDEN_TOP_KEYS,
|
||||
type Envelope,
|
||||
type EnvelopeEdge,
|
||||
type EnvelopeNode,
|
||||
} from '../types';
|
||||
|
||||
export interface BuildEnvelopeInput {
|
||||
source: { uri: string; content_hash: string; anchor?: string; commit?: string; block_id?: string };
|
||||
extractor: { model: string; tier: 'shallow' | 'deep'; extracted_at?: number };
|
||||
nodes?: EnvelopeNode[];
|
||||
triplets: EnvelopeEdge[];
|
||||
}
|
||||
|
||||
/**
|
||||
* 組 envelope 並驗證(strict)。
|
||||
* - 結構符號/散文不該進;nodes/triplets 由上游(harvest/extract)已過濾。
|
||||
* - 驗證失敗(多帶禁送欄位、形狀錯)→ throw ZodError,呼叫端攔(比送出去被 graph 422 早)。
|
||||
*/
|
||||
export function buildEnvelope(input: BuildEnvelopeInput): Envelope {
|
||||
// 顯式禁送欄位自檢(除了 strict schema,多一道明確攔——上游若塞 graph 領域欄位提早炸)。
|
||||
for (const n of input.nodes ?? []) {
|
||||
for (const k of [...FORBIDDEN_TOP_KEYS, 'clusters']) {
|
||||
if (k !== 'id' && k in (n as Record<string, unknown>)) {
|
||||
throw new Error(`envelope: node「${n.name}」帶禁送欄位 ${k}(graph 領域,ingest 不可送)`);
|
||||
}
|
||||
}
|
||||
}
|
||||
for (const t of input.triplets) {
|
||||
for (const k of FORBIDDEN_EDGE_KEYS) {
|
||||
if (k in (t as Record<string, unknown>)) {
|
||||
throw new Error(`envelope: 邊「${t.subject}>>${t.object}」帶禁送欄位 ${k}(類型只走 nodes[])`);
|
||||
}
|
||||
}
|
||||
}
|
||||
const candidate: Envelope = {
|
||||
source: input.source,
|
||||
extractor: input.extractor,
|
||||
triplets: input.triplets,
|
||||
...(input.nodes && input.nodes.length ? { nodes: input.nodes } : {}),
|
||||
};
|
||||
// strict 驗證:等於本地版「禁送欄位 → 擋」。throw 給呼叫端。
|
||||
return EnvelopeSchema.parse(candidate);
|
||||
}
|
||||
@@ -0,0 +1,110 @@
|
||||
// T3 extract(路徑 B,fallback)— 裸原文無本地三元組時,ingest 自己萃 (s,p,o)+gloss。
|
||||
//
|
||||
// 模型用戶可選(意圖非型號):shallow=Haiku/Workers AI(預設、便宜);deep=Claude via CC(深萃、走月費)。
|
||||
// JSON-fail 升級閘:shallow 解析失敗 / 萃太稀 → 升 deep 重萃一次。
|
||||
// 第一版不 embed(embed 等 base vectorize / Arcrun #7)——但仍【打標】embed/predicate_embed 供未來讀標。
|
||||
// 端點對齊護欄:萃完用 endpoint-check 自檢 + 自動補齊(leo 壓測 14→0)。
|
||||
//
|
||||
// LLM 呼叫抽象成 LlmCaller 介面 → 測試走 mock,不打網路、不花錢。
|
||||
|
||||
import type { EnvelopeEdge, EnvelopeNode } from '../types';
|
||||
import { autoAlignEndpoints, checkEndpointAlignment } from './endpoint-check';
|
||||
|
||||
export type ExtractTier = 'shallow' | 'deep';
|
||||
|
||||
export interface ExtractedGraph {
|
||||
nodes: EnvelopeNode[];
|
||||
triplets: EnvelopeEdge[];
|
||||
}
|
||||
|
||||
/** 一次 LLM 萃取呼叫。回傳模型【原始文字】(期望是 JSON),由本模組負責 parse。 */
|
||||
export interface LlmCaller {
|
||||
/** model = 解析後的具體型號字串(供 extractor.model 記錄)。 */
|
||||
readonly model: string;
|
||||
call(prompt: string, text: string): Promise<string>;
|
||||
}
|
||||
|
||||
export interface ExtractResult extends ExtractedGraph {
|
||||
tier: ExtractTier;
|
||||
model: string;
|
||||
/** 是否因 shallow JSON-fail/過稀而升級到 deep。 */
|
||||
escalated: boolean;
|
||||
}
|
||||
|
||||
const EXTRACT_PROMPT = `你是知識圖譜萃取器。讀下面的原文,萃出三元組與實體。嚴格輸出 JSON(繁體中文內容),格式:
|
||||
{
|
||||
"nodes": [{"name": "正規名", "gloss": "一句話定義(這個實體是什麼)", "aliases": ["同義詞"]}],
|
||||
"triplets": [{"subject": "主詞", "predicate": "動詞短語", "object": "受詞", "confidence": 0.0-1.0}]
|
||||
}
|
||||
規則:
|
||||
- 謂詞用動詞/動詞短語(如「奠基於」「反駁」),禁名詞當謂詞。
|
||||
- triplet 的 subject/object 必須對得上某個 nodes[].name(一字不差)。
|
||||
- 抓深層暗示,不只表面陳述。只輸出 JSON,不要其他文字。`;
|
||||
|
||||
/** 解析模型輸出的 JSON(容忍 ```json fenced 區塊)。失敗 throw。 */
|
||||
export function parseExtractJson(raw: string): ExtractedGraph {
|
||||
const fenced = /```(?:json)?\s*([\s\S]*?)```/.exec(raw);
|
||||
const jsonText = (fenced ? fenced[1] : raw).trim();
|
||||
const parsed = JSON.parse(jsonText) as Partial<ExtractedGraph>;
|
||||
if (!Array.isArray(parsed.triplets) || parsed.triplets.length === 0) {
|
||||
throw new Error('extract: no triplets in model output');
|
||||
}
|
||||
const nodes: EnvelopeNode[] = (parsed.nodes ?? []).map((n) => ({
|
||||
name: String(n.name),
|
||||
gloss: n.gloss ? String(n.gloss) : undefined,
|
||||
aliases: Array.isArray(n.aliases) ? n.aliases.map(String) : undefined,
|
||||
embed: true, // 打標 true(base 讀標執行;第一版 base 還沒接,標仍合契約)
|
||||
}));
|
||||
const triplets: EnvelopeEdge[] = parsed.triplets.map((t) => ({
|
||||
subject: String(t.subject),
|
||||
predicate: String(t.predicate),
|
||||
object: String(t.object),
|
||||
confidence: typeof t.confidence === 'number' ? t.confidence : undefined,
|
||||
predicate_embed: true,
|
||||
}));
|
||||
return { nodes, triplets };
|
||||
}
|
||||
|
||||
/** 萃太稀(門檻)→ 視為失敗、觸發升級。 */
|
||||
function tooSparse(g: ExtractedGraph): boolean {
|
||||
return g.triplets.length < 1;
|
||||
}
|
||||
|
||||
/**
|
||||
* extract:先用 shallowCaller 淺萃;JSON-fail 或過稀 → 若有 deepCaller 升級重萃一次。
|
||||
* 萃完跑端點對齊護欄並自動補齊。deepCaller 省略 = 不升級(純 shallow)。
|
||||
*/
|
||||
export async function extract(
|
||||
text: string,
|
||||
shallowCaller: LlmCaller,
|
||||
deepCaller?: LlmCaller,
|
||||
): Promise<ExtractResult> {
|
||||
let tier: ExtractTier = 'shallow';
|
||||
let model = shallowCaller.model;
|
||||
let graph: ExtractedGraph | null = null;
|
||||
let escalated = false;
|
||||
|
||||
try {
|
||||
graph = parseExtractJson(await shallowCaller.call(EXTRACT_PROMPT, text));
|
||||
if (tooSparse(graph)) throw new Error('extract: shallow too sparse');
|
||||
} catch {
|
||||
graph = null;
|
||||
}
|
||||
|
||||
if (!graph && deepCaller) {
|
||||
escalated = true;
|
||||
tier = 'deep';
|
||||
model = deepCaller.model;
|
||||
graph = parseExtractJson(await deepCaller.call(EXTRACT_PROMPT, text)); // deep 失敗就 throw 給呼叫端
|
||||
}
|
||||
|
||||
if (!graph) throw new Error('extract: shallow failed and no deep caller to escalate');
|
||||
|
||||
// 端點對齊護欄(leo 壓測必做):自檢 + 自動補齊(保住邊,不丟)。
|
||||
const aligned = autoAlignEndpoints(graph.nodes, graph.triplets);
|
||||
const report = checkEndpointAlignment(aligned, graph.triplets);
|
||||
// 補齊後理應全對齊;若仍有(理論上不會)留給呼叫端,但不阻斷。
|
||||
void report;
|
||||
|
||||
return { nodes: aligned, triplets: graph.triplets, tier, model, escalated };
|
||||
}
|
||||
@@ -0,0 +1,58 @@
|
||||
// T5 graph client — cherry-pick 自 polaris/mira/tools/_kbdb_client.py 的 HTTP-helper 模式,
|
||||
// 但【改成純餵食器】:只 POST envelope 給 graph 寫入端,**不寫 base、不碰 D1/Vectorize/表**。
|
||||
//
|
||||
// 原 _kbdb_client.py 直打 base /kbdb/entries(碰儲存)——那正是 ingest 鐵律禁止的。
|
||||
// 本檔保留它的「統一 http wrapper + header + 容錯回傳」骨架,把目標改成 graph 的
|
||||
// POST /triplets/ingest(API-as-Wall:ingest 只透過 graph HTTP 寫入端餵候選)。
|
||||
|
||||
import type { Envelope } from '../types';
|
||||
|
||||
export interface PostResult {
|
||||
ok: boolean;
|
||||
/** graph 回的 {skipped,ingested,deprecated}(200);422/未設時 ok=false。 */
|
||||
status: number;
|
||||
body?: unknown;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
export interface GraphClient {
|
||||
postEnvelope(env: Envelope): Promise<PostResult>;
|
||||
}
|
||||
|
||||
/**
|
||||
* 真實 graph client。baseUrl 空 → 誠實回 {ok:false, error:'GRAPH_BASE_URL 未設'},不假綠
|
||||
* (對齊 graph 端 refresh「未設 ingest URL 誠實回 forwarded:false」的誠實原則)。
|
||||
*/
|
||||
export function makeGraphClient(
|
||||
baseUrl: string | undefined,
|
||||
token?: string,
|
||||
fetchImpl: typeof fetch = fetch,
|
||||
): GraphClient {
|
||||
return {
|
||||
async postEnvelope(env) {
|
||||
if (!baseUrl) {
|
||||
return { ok: false, status: 0, error: 'GRAPH_BASE_URL 未設:graph 寫入端尚未就緒/未部署,envelope 無對象可送。' };
|
||||
}
|
||||
const headers: Record<string, string> = { 'Content-Type': 'application/json' };
|
||||
if (token) headers.Authorization = `Bearer ${token}`;
|
||||
const url = baseUrl.replace(/\/$/, '') + '/triplets/ingest';
|
||||
let res: Response;
|
||||
try {
|
||||
res = await fetchImpl(url, { method: 'POST', headers, body: JSON.stringify(env) });
|
||||
} catch (e) {
|
||||
return { ok: false, status: 0, error: `[graph] POST ${url}: ${(e as Error).message}` };
|
||||
}
|
||||
let body: unknown;
|
||||
try {
|
||||
body = await res.json();
|
||||
} catch {
|
||||
body = undefined;
|
||||
}
|
||||
// 422 = envelope 違規(禁送欄位/形狀)→ 不 ok,帶 graph 回的 issues 供修。
|
||||
if (!res.ok) {
|
||||
return { ok: false, status: res.status, body, error: `graph ${res.status} ${res.statusText}` };
|
||||
}
|
||||
return { ok: true, status: res.status, body };
|
||||
},
|
||||
};
|
||||
}
|
||||
@@ -0,0 +1,146 @@
|
||||
// T2 採取(路徑 A,優先)— 從 system-dev-template 1.8.0+ 的 wiki 卡採取已建好的三元組+gloss。
|
||||
//
|
||||
// 本地萃成效更好(知識連結長在生產當下、有 LLM Wiki 指引),ingest 優先採取、不重萃。
|
||||
// 解析卡片格式(與本 repo system-dev/wiki/cards 同源):
|
||||
// frontmatter: gloss:(卡標題 node 的描述)
|
||||
// ## 實體:一行一個 `- **正規名**(aliases…)— 描述句`(內文 node + gloss)
|
||||
// ## 關聯:typed-edge `A >> 謂詞 >> B`(內文裸文字端點)/ `[[卡]] >> 謂詞 >> [[卡]]`(卡對卡)
|
||||
//
|
||||
// 鐵律:結構符號(>>/←)與散文(## 摘要)不進 envelope。打標 embed/predicate_embed(預設 true)。
|
||||
|
||||
import type { EnvelopeEdge, EnvelopeNode } from '../types';
|
||||
|
||||
export interface HarvestResult {
|
||||
nodes: EnvelopeNode[];
|
||||
triplets: EnvelopeEdge[];
|
||||
/** 端點對不齊 `## 實體` 的三元組(自檢護欄;見 endpoint-check.ts 用此 warn)。 */
|
||||
unalignedEndpoints: string[];
|
||||
}
|
||||
|
||||
interface Frontmatter {
|
||||
gloss?: string;
|
||||
tags?: string[];
|
||||
}
|
||||
|
||||
/** 抽 frontmatter(--- … ---)。簡單 YAML,只取 gloss / tags。 */
|
||||
export function parseFrontmatter(md: string): { fm: Frontmatter; body: string } {
|
||||
const m = /^---\n([\s\S]*?)\n---\n?([\s\S]*)$/.exec(md);
|
||||
if (!m) return { fm: {}, body: md };
|
||||
const fm: Frontmatter = {};
|
||||
for (const line of m[1].split('\n')) {
|
||||
const g = /^gloss:\s*(.+)$/.exec(line.trim());
|
||||
if (g) fm.gloss = g[1].replace(/^["']|["']$/g, '').trim();
|
||||
}
|
||||
return { fm, body: m[2] };
|
||||
}
|
||||
|
||||
/** 取卡標題(首個 # H1)。 */
|
||||
export function parseTitle(body: string): string | null {
|
||||
const m = /^#\s+(.+)$/m.exec(body);
|
||||
return m ? m[1].trim() : null;
|
||||
}
|
||||
|
||||
/** 抽某 H2 段落內文(到下個 H2 或檔尾)。H3 子節(### …)仍算段內。 */
|
||||
function section(body: string, heading: string): string | null {
|
||||
// 不用 m 旗標(避免 $ 在每行尾命中);終止 = 下個 `\n## `(H2,非 H3)或字串尾。
|
||||
const re = new RegExp(`(?:^|\\n)##\\s+${heading}[^\\n]*\\n([\\s\\S]*?)(?=\\n##\\s|$)`);
|
||||
const m = re.exec(body);
|
||||
return m ? m[1] : null;
|
||||
}
|
||||
|
||||
/** 解析 `## 實體` 行:`- **正規名**(alias1/alias2)— 描述句`。 */
|
||||
export function parseEntities(body: string): EnvelopeNode[] {
|
||||
const sec = section(body, '實體');
|
||||
if (!sec) return [];
|
||||
const out: EnvelopeNode[] = [];
|
||||
for (const raw of sec.split('\n')) {
|
||||
const line = raw.trim();
|
||||
if (!line.startsWith('-')) continue;
|
||||
// - **名**(aliases)— gloss 或 - **名** — gloss 或 - **名**
|
||||
const m = /^-\s*\*\*(.+?)\*\*\s*(?:((.+?)))?\s*(?:[—-]\s*(.+))?$/.exec(line);
|
||||
if (!m) continue;
|
||||
const name = m[1].trim();
|
||||
// 別名分隔用全形「/」「、」(template 慣例);ASCII '/' 不切(如 arcrun/kbdb 是一個別名)。
|
||||
const aliases = m[2]
|
||||
? m[2].split(/[/、]/).map((s) => s.trim()).filter(Boolean)
|
||||
: undefined;
|
||||
const gloss = m[3]?.trim() || undefined;
|
||||
const node: EnvelopeNode = { name, embed: true };
|
||||
if (gloss) node.gloss = gloss;
|
||||
if (aliases && aliases.length) node.aliases = aliases;
|
||||
out.push(node);
|
||||
}
|
||||
return out;
|
||||
}
|
||||
|
||||
/** 一條解析出的邊 + 它的兩端是否為卡對卡(原文帶 [[ ]])。 */
|
||||
export interface ParsedEdge extends EnvelopeEdge {
|
||||
/** subject 端原文是 [[wikilink]](卡對卡,不要求對齊 ## 實體)。 */
|
||||
subjectIsCard: boolean;
|
||||
objectIsCard: boolean;
|
||||
}
|
||||
|
||||
/** 解析 typed-edge 行 `A >> 謂詞 >> B`(sep 可設,預設 >>)。端點去 `[[ ]]`、`**`。 */
|
||||
export function parseEdges(body: string, sep = '>>'): ParsedEdge[] {
|
||||
const sec = section(body, '關聯');
|
||||
if (!sec) return [];
|
||||
const out: ParsedEdge[] = [];
|
||||
const escSep = sep.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
|
||||
const re = new RegExp(`^(.+?)\\s*${escSep}\\s*(.+?)\\s*${escSep}\\s*(.+?)$`);
|
||||
for (const raw of sec.split('\n')) {
|
||||
const line = raw.trim();
|
||||
if (!line.startsWith('-')) continue;
|
||||
const m = re.exec(line.replace(/^-\s*/, ''));
|
||||
if (!m) continue;
|
||||
const clean = (s: string) => s.replace(/\[\[|\]\]/g, '').replace(/\*\*/g, '').trim();
|
||||
out.push({
|
||||
subject: clean(m[1]),
|
||||
predicate: m[2].trim(),
|
||||
object: clean(m[3]),
|
||||
predicate_embed: true,
|
||||
subjectIsCard: /\[\[.+?\]\]/.test(m[1]),
|
||||
objectIsCard: /\[\[.+?\]\]/.test(m[3]),
|
||||
});
|
||||
}
|
||||
return out;
|
||||
}
|
||||
|
||||
/** 採取單張卡 → nodes + triplets(含卡標題 node 的 frontmatter gloss)。 */
|
||||
export function harvestCard(md: string): HarvestResult {
|
||||
const { fm, body } = parseFrontmatter(md);
|
||||
const title = parseTitle(body);
|
||||
const nodes = parseEntities(body);
|
||||
|
||||
// 卡標題本身是個 node(wikilink 卡)。frontmatter gloss 描述它。
|
||||
if (title && !nodes.some((n) => n.name === title)) {
|
||||
const cardNode: EnvelopeNode = { name: title, embed: true };
|
||||
if (fm.gloss) cardNode.gloss = fm.gloss;
|
||||
nodes.unshift(cardNode);
|
||||
}
|
||||
|
||||
const parsed = parseEdges(body);
|
||||
|
||||
// 卡對卡端點(原文 [[卡]])也是 graph node(被連到的卡)→ 補進 nodes(embed:true,無 gloss)。
|
||||
// 這樣它們對齊、且下游知道有這些卡 node。
|
||||
const nodeNames = new Set(nodes.map((n) => n.name));
|
||||
for (const e of parsed) {
|
||||
if (e.subjectIsCard && !nodeNames.has(e.subject)) { nodeNames.add(e.subject); nodes.push({ name: e.subject, embed: true }); }
|
||||
if (e.objectIsCard && !nodeNames.has(e.object)) { nodeNames.add(e.object); nodes.push({ name: e.object, embed: true }); }
|
||||
}
|
||||
|
||||
// 端點對齊自檢(leo 壓測護欄):內文三元組端點(非卡對卡)須對得上某 node 名。
|
||||
const unalignedEndpoints: string[] = [];
|
||||
for (const e of parsed) {
|
||||
if (!e.subjectIsCard && !nodeNames.has(e.subject))
|
||||
unalignedEndpoints.push(`${e.subject}(在「${e.subject} >> ${e.predicate} >> ${e.object}」)`);
|
||||
if (!e.objectIsCard && !nodeNames.has(e.object))
|
||||
unalignedEndpoints.push(`${e.object}(在「${e.subject} >> ${e.predicate} >> ${e.object}」)`);
|
||||
}
|
||||
|
||||
// 去掉 ParsedEdge 的 isCard 標記 → 純 EnvelopeEdge。
|
||||
const triplets: EnvelopeEdge[] = parsed.map(({ subject, predicate, object, predicate_embed, confidence }) => ({
|
||||
subject, predicate, object, predicate_embed, ...(confidence !== undefined ? { confidence } : {}),
|
||||
}));
|
||||
|
||||
return { nodes, triplets, unalignedEndpoints };
|
||||
}
|
||||
@@ -0,0 +1,59 @@
|
||||
// 編排:source → 採取(路徑A優先) / 萃取(路徑B fallback) → envelope。
|
||||
//
|
||||
// 每個 SourceFile 出一個 envelope(契約:一檔一 envelope)。採取優先:卡有三元組就採;
|
||||
// 採不到(無 ## 關聯 / 非 template 卡)才走 extract。跨 repo 織網在更上層(weave)匯總。
|
||||
|
||||
import { harvestCard } from './harvest';
|
||||
import { extract, type LlmCaller } from './extract';
|
||||
import { buildEnvelope } from './envelope';
|
||||
import type { SourceFile } from './source-adapter';
|
||||
import type { Envelope } from '../types';
|
||||
|
||||
export interface ProcessOptions {
|
||||
shallowCaller?: LlmCaller;
|
||||
deepCaller?: LlmCaller;
|
||||
/** 採取(路徑 A)模型標記,記進 extractor.model。預設 'local-harvest'。 */
|
||||
harvestModel?: string;
|
||||
}
|
||||
|
||||
export interface ProcessResult {
|
||||
envelope: Envelope | null;
|
||||
path: 'harvest' | 'extract' | 'skipped';
|
||||
note?: string;
|
||||
}
|
||||
|
||||
/** 採取結果是否「夠」(有三元組)→ 不必 fallback 到 extract。 */
|
||||
function harvestSufficient(triplets: unknown[]): boolean {
|
||||
return triplets.length > 0;
|
||||
}
|
||||
|
||||
/** 處理單一來源檔 → envelope(採取優先,採不到 fallback extract)。 */
|
||||
export async function processSource(file: SourceFile, opts: ProcessOptions = {}): Promise<ProcessResult> {
|
||||
// 路徑 A:採取本地已建三元組+gloss。
|
||||
const harvested = harvestCard(file.text);
|
||||
if (harvestSufficient(harvested.triplets)) {
|
||||
const envelope = buildEnvelope({
|
||||
source: { uri: file.uri, content_hash: file.content_hash, commit: file.commit },
|
||||
extractor: { model: opts.harvestModel ?? 'local-harvest', tier: 'shallow' },
|
||||
nodes: harvested.nodes,
|
||||
triplets: harvested.triplets,
|
||||
});
|
||||
const note = harvested.unalignedEndpoints.length
|
||||
? `採取:${harvested.unalignedEndpoints.length} 端點對不齊(已留 node)`
|
||||
: undefined;
|
||||
return { envelope, path: 'harvest', note };
|
||||
}
|
||||
|
||||
// 路徑 B:裸原文 extract(需 shallowCaller)。
|
||||
if (!opts.shallowCaller) {
|
||||
return { envelope: null, path: 'skipped', note: '無本地三元組且未提供萃取模型 → 跳過' };
|
||||
}
|
||||
const ex = await extract(file.text, opts.shallowCaller, opts.deepCaller);
|
||||
const envelope = buildEnvelope({
|
||||
source: { uri: file.uri, content_hash: file.content_hash, commit: file.commit },
|
||||
extractor: { model: ex.model, tier: ex.tier },
|
||||
nodes: ex.nodes,
|
||||
triplets: ex.triplets,
|
||||
});
|
||||
return { envelope, path: 'extract', note: ex.escalated ? '淺萃失敗 → 升 deep' : undefined };
|
||||
}
|
||||
@@ -0,0 +1,108 @@
|
||||
// T1 SourceAdapter — 從 GitHub 拉 repo 的 MD 檔 + per-file content-hash。
|
||||
//
|
||||
// 鐵律:runtime 用 GitHub API 拉 repo(不開 Actions、不掛 webhook 自動同步)。
|
||||
// 拉是 runtime 行為(人/refresh 發起的一次調用),不衝突 flag 紅線。
|
||||
// source.uri = 'github:<owner>/<repo>@<path>'(穩定識別 = 快照鍵 + get_source 指標)。
|
||||
|
||||
export interface SourceFile {
|
||||
/** github:owner/repo@path */
|
||||
uri: string;
|
||||
/** 檔內相對路徑(owner/repo 之外的部分)。 */
|
||||
path: string;
|
||||
/** 原始檔內容(UTF-8)。 */
|
||||
text: string;
|
||||
/** content_hash(sha256 hex,快照鍵)。 */
|
||||
content_hash: string;
|
||||
/** git commit sha(可追溯,選填)。 */
|
||||
commit?: string;
|
||||
}
|
||||
|
||||
/** sha256 hex —— Workers 與 Node 18+ 皆有 crypto.subtle。 */
|
||||
export async function contentHash(text: string): Promise<string> {
|
||||
const data = new TextEncoder().encode(text);
|
||||
const digest = await crypto.subtle.digest('SHA-256', data);
|
||||
return [...new Uint8Array(digest)].map((b) => b.toString(16).padStart(2, '0')).join('');
|
||||
}
|
||||
|
||||
/** 組 source.uri(單一真相格式,全程經此函式產,避免拼錯)。 */
|
||||
export function makeSourceUri(owner: string, repo: string, path: string): string {
|
||||
return `github:${owner}/${repo}@${path}`;
|
||||
}
|
||||
|
||||
/** 解析 source.uri 回 {owner, repo, path}。null = 格式不符。 */
|
||||
export function parseSourceUri(uri: string): { owner: string; repo: string; path: string } | null {
|
||||
const m = /^github:([^/]+)\/([^@]+)@(.+)$/.exec(uri);
|
||||
if (!m) return null;
|
||||
return { owner: m[1], repo: m[2], path: m[3] };
|
||||
}
|
||||
|
||||
export interface GitHubFetcher {
|
||||
/** 列出 repo 內某路徑下的 MD 檔(遞迴)。回傳檔路徑 list。 */
|
||||
listMarkdown(owner: string, repo: string, root?: string): Promise<string[]>;
|
||||
/** 取單檔原文 + commit sha。 */
|
||||
getFile(owner: string, repo: string, path: string): Promise<{ text: string; commit?: string }>;
|
||||
}
|
||||
|
||||
/**
|
||||
* 真實 GitHub API fetcher(runtime 拉,非 Actions)。
|
||||
* token 選填:公庫可不帶;私庫帶 GITHUB_TOKEN。測試走 mock,不打網路。
|
||||
*/
|
||||
export function makeGitHubFetcher(token?: string, fetchImpl: typeof fetch = fetch): GitHubFetcher {
|
||||
const headers: Record<string, string> = {
|
||||
Accept: 'application/vnd.github+json',
|
||||
'User-Agent': 'kbdb-ingest-plugin',
|
||||
};
|
||||
if (token) headers.Authorization = `Bearer ${token}`;
|
||||
|
||||
const api = 'https://api.github.com';
|
||||
|
||||
return {
|
||||
async listMarkdown(owner, repo, root = '') {
|
||||
// git/trees 遞迴:一次 API call 拿整棵樹(避免逐目錄 fan-out 流量)。
|
||||
const res = await fetchImpl(`${api}/repos/${owner}/${repo}/git/trees/HEAD?recursive=1`, { headers });
|
||||
if (!res.ok) throw new Error(`[github] list ${owner}/${repo}: ${res.status} ${res.statusText}`);
|
||||
const body = (await res.json()) as { tree?: Array<{ path: string; type: string }> };
|
||||
const prefix = root.replace(/^\/+|\/+$/g, '');
|
||||
return (body.tree ?? [])
|
||||
.filter((e) => e.type === 'blob' && e.path.endsWith('.md'))
|
||||
.map((e) => e.path)
|
||||
.filter((p) => (prefix ? p === prefix || p.startsWith(prefix + '/') : true));
|
||||
},
|
||||
async getFile(owner, repo, path) {
|
||||
const res = await fetchImpl(`${api}/repos/${owner}/${repo}/contents/${encodeURIComponent(path).replace(/%2F/g, '/')}`, { headers });
|
||||
if (!res.ok) throw new Error(`[github] get ${owner}/${repo}@${path}: ${res.status} ${res.statusText}`);
|
||||
const body = (await res.json()) as { content?: string; encoding?: string; sha?: string };
|
||||
const text = body.encoding === 'base64' && body.content ? decodeBase64Utf8(body.content) : (body.content ?? '');
|
||||
return { text, commit: body.sha };
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
function decodeBase64Utf8(b64: string): string {
|
||||
const clean = b64.replace(/\n/g, '');
|
||||
const bin = atob(clean);
|
||||
const bytes = Uint8Array.from(bin, (c) => c.charCodeAt(0));
|
||||
return new TextDecoder('utf-8').decode(bytes);
|
||||
}
|
||||
|
||||
/** 拉一個 repo 路徑下所有 MD → SourceFile[](含 content_hash)。 */
|
||||
export async function pullRepoMarkdown(
|
||||
fetcher: GitHubFetcher,
|
||||
owner: string,
|
||||
repo: string,
|
||||
root = '',
|
||||
): Promise<SourceFile[]> {
|
||||
const paths = await fetcher.listMarkdown(owner, repo, root);
|
||||
const out: SourceFile[] = [];
|
||||
for (const path of paths) {
|
||||
const { text, commit } = await fetcher.getFile(owner, repo, path);
|
||||
out.push({
|
||||
uri: makeSourceUri(owner, repo, path),
|
||||
path,
|
||||
text,
|
||||
content_hash: await contentHash(text),
|
||||
commit,
|
||||
});
|
||||
}
|
||||
return out;
|
||||
}
|
||||
Binary file not shown.
@@ -0,0 +1,85 @@
|
||||
// 共用型別 + envelope 契約鏡射(contracts/ingest-candidate.json,full 版含向量化打標)。
|
||||
//
|
||||
// 鐵律:ingest 純餵食器,只【打標】embed/predicate_embed + 帶 gloss/aliases;
|
||||
// 實際 embedding 歸 base/KBDB embed 模組讀標執行。ingest 自己不算向量。
|
||||
// envelope 是 ingest↔graph 唯一耦合面(三守則:凍結契約)。
|
||||
|
||||
import { z } from '@hono/zod-openapi';
|
||||
|
||||
export interface Bindings {
|
||||
ENVIRONMENT?: string;
|
||||
/** graph 寫入端 base URL;空 = 未部署,POST 時誠實報 not-configured,不假綠。 */
|
||||
GRAPH_BASE_URL?: string;
|
||||
/** 萃取預設 tier 意圖(shallow=Haiku;deep=Claude via CC)。 */
|
||||
DEFAULT_EXTRACT_TIER?: 'shallow' | 'deep';
|
||||
/** 拉 GitHub 私庫用(公庫可空)。走 secret put。 */
|
||||
GITHUB_TOKEN?: string;
|
||||
/** graph 寫入端 bearer(對應 graph 的 KBDB_INTERNAL_TOKEN)。走 secret put。 */
|
||||
GRAPH_INTERNAL_TOKEN?: string;
|
||||
}
|
||||
|
||||
export interface Variables {
|
||||
partner_id: string;
|
||||
}
|
||||
|
||||
// ── envelope 契約(full:含 ingest#1 升格的向量化打標欄位)──────────────
|
||||
// graph 收件端 .strict() 追上 contract(graph#1 補對齊任務)後即收得下這些欄位。
|
||||
|
||||
export const EnvelopeNodeSchema = z
|
||||
.object({
|
||||
name: z.string().min(1),
|
||||
/** 去重鍵:wikilink 卡用檔名(一卡一 node,不以出現次數重複 embed);實體用正規名。 */
|
||||
id: z.string().optional(),
|
||||
/** 一句話描述。base embed【名+gloss 一起】拉近同義詞。建議 deep tier 產。 */
|
||||
gloss: z.string().optional(),
|
||||
/** 同義詞(黃仁勳/Jensen Huang)。base 歸一成同一 node。 */
|
||||
aliases: z.array(z.string()).optional(),
|
||||
/** 向量化打標:此 node 要不要進向量庫。預設 true。ingest 打標,base 讀標執行。 */
|
||||
embed: z.boolean().optional(),
|
||||
entity_type: z.enum(['person', 'event', 'product', 'market', 'org']).optional(),
|
||||
})
|
||||
.strict();
|
||||
|
||||
export const EnvelopeEdgeSchema = z
|
||||
.object({
|
||||
subject: z.string().min(1),
|
||||
predicate: z.string().min(1),
|
||||
object: z.string().min(1),
|
||||
/** 謂詞向量化打標(裸詞 embed,無描述)→ predicate_vector,支援關係過濾。預設 true。 */
|
||||
predicate_embed: z.boolean().optional(),
|
||||
confidence: z.number().min(0).max(1).optional(),
|
||||
})
|
||||
.strict();
|
||||
|
||||
export const EnvelopeSchema = z
|
||||
.object({
|
||||
source: z
|
||||
.object({
|
||||
/** 'github:<owner>/<repo>@<path>',= 快照鍵 + get_source 指標。 */
|
||||
uri: z.string().min(1),
|
||||
/** 來源檔內容 hash(快照鍵)。graph 比對同 hash → no-op。 */
|
||||
content_hash: z.string().min(1),
|
||||
anchor: z.string().optional(),
|
||||
commit: z.string().optional(),
|
||||
block_id: z.string().optional(),
|
||||
})
|
||||
.strict(),
|
||||
extractor: z
|
||||
.object({
|
||||
model: z.string().min(1),
|
||||
tier: z.enum(['shallow', 'deep']),
|
||||
extracted_at: z.number().int().optional(),
|
||||
})
|
||||
.strict(),
|
||||
nodes: z.array(EnvelopeNodeSchema).optional(),
|
||||
triplets: z.array(EnvelopeEdgeSchema).min(1),
|
||||
})
|
||||
.strict();
|
||||
|
||||
export type EnvelopeNode = z.infer<typeof EnvelopeNodeSchema>;
|
||||
export type EnvelopeEdge = z.infer<typeof EnvelopeEdgeSchema>;
|
||||
export type Envelope = z.infer<typeof EnvelopeSchema>;
|
||||
|
||||
/** graph 領域欄位 — ingest 絕不可送(送了被 graph 422)。用於本地自檢,提早攔。 */
|
||||
export const FORBIDDEN_TOP_KEYS = ['id', 'clusters', 'bridge_score', 'created_at', 'updated_at'] as const;
|
||||
export const FORBIDDEN_EDGE_KEYS = ['subject_entity_type', 'object_entity_type'] as const;
|
||||
@@ -0,0 +1,47 @@
|
||||
import { describe, it, expect } from 'vitest';
|
||||
import { buildEnvelope } from '../src/lib/envelope';
|
||||
|
||||
const base = {
|
||||
source: { uri: 'github:o/r@a.md', content_hash: 'abc' },
|
||||
extractor: { model: 'local-harvest', tier: 'shallow' as const },
|
||||
triplets: [{ subject: 'A', predicate: 'p', object: 'B', predicate_embed: true }],
|
||||
};
|
||||
|
||||
describe('buildEnvelope', () => {
|
||||
it('組合法 envelope(含向量化打標欄位)', () => {
|
||||
const env = buildEnvelope({
|
||||
...base,
|
||||
nodes: [{ name: 'A', gloss: 'a', aliases: ['a2'], embed: true, id: 'A' }],
|
||||
});
|
||||
expect(env.source.uri).toBe('github:o/r@a.md');
|
||||
expect(env.nodes?.[0].embed).toBe(true);
|
||||
expect(env.nodes?.[0].id).toBe('A');
|
||||
expect(env.triplets[0].predicate_embed).toBe(true);
|
||||
});
|
||||
|
||||
it('node 帶禁送欄位(bridge_score)→ strict throw(本地提早攔,不等 graph 422)', () => {
|
||||
expect(() => buildEnvelope({ ...base, nodes: [{ name: 'A', embed: true }] })).not.toThrow();
|
||||
expect(() =>
|
||||
buildEnvelope({ ...base, nodes: [{ name: 'A', bridge_score: 0.5 } as any] }),
|
||||
).toThrow();
|
||||
});
|
||||
|
||||
it('node 帶 graph 領域 record id(非去重 id)以外的禁送鍵 → strict throw', () => {
|
||||
// 契約允許 nodes[].id(去重鍵);但 clusters 是 graph 領域 → strict 擋。
|
||||
expect(() => buildEnvelope({ ...base, nodes: [{ name: 'A', id: 'A', embed: true }] })).not.toThrow();
|
||||
expect(() => buildEnvelope({ ...base, nodes: [{ name: 'A', clusters: ['c'] } as any] })).toThrow();
|
||||
});
|
||||
|
||||
it('禁送邊上 entity_type → strict throw', () => {
|
||||
expect(() =>
|
||||
buildEnvelope({
|
||||
...base,
|
||||
triplets: [{ subject: 'A', predicate: 'p', object: 'B', subject_entity_type: 'person' } as any],
|
||||
}),
|
||||
).toThrow();
|
||||
});
|
||||
|
||||
it('無 triplets → throw(契約 min 1)', () => {
|
||||
expect(() => buildEnvelope({ ...base, triplets: [] })).toThrow();
|
||||
});
|
||||
});
|
||||
@@ -0,0 +1,58 @@
|
||||
import { describe, it, expect } from 'vitest';
|
||||
import { extract, parseExtractJson, type LlmCaller } from '../src/lib/extract';
|
||||
|
||||
const GOOD_JSON = JSON.stringify({
|
||||
nodes: [
|
||||
{ name: '原子筆記', gloss: '一個不可再分論點的記錄單元' },
|
||||
{ name: '傳統筆記', gloss: '多主題混雜的記錄' },
|
||||
],
|
||||
triplets: [{ subject: '原子筆記', predicate: '對立於', object: '傳統筆記', confidence: 0.9 }],
|
||||
});
|
||||
|
||||
function caller(model: string, out: string | (() => Promise<string>)): LlmCaller {
|
||||
return { model, call: typeof out === 'string' ? async () => out : out };
|
||||
}
|
||||
|
||||
describe('parseExtractJson', () => {
|
||||
it('解析 fenced JSON + 打標 embed/predicate_embed', () => {
|
||||
const g = parseExtractJson('```json\n' + GOOD_JSON + '\n```');
|
||||
expect(g.triplets[0].predicate_embed).toBe(true);
|
||||
expect(g.nodes[0].embed).toBe(true);
|
||||
expect(g.triplets[0].confidence).toBe(0.9);
|
||||
});
|
||||
|
||||
it('無 triplets → throw', () => {
|
||||
expect(() => parseExtractJson(JSON.stringify({ nodes: [], triplets: [] }))).toThrow();
|
||||
});
|
||||
});
|
||||
|
||||
describe('extract', () => {
|
||||
it('淺萃成功不升級', async () => {
|
||||
const r = await extract('原文', caller('haiku', GOOD_JSON));
|
||||
expect(r.tier).toBe('shallow');
|
||||
expect(r.escalated).toBe(false);
|
||||
expect(r.model).toBe('haiku');
|
||||
});
|
||||
|
||||
it('淺萃 JSON-fail → 升 deep(升級閘)', async () => {
|
||||
const r = await extract('原文', caller('haiku', 'not json at all'), caller('claude', GOOD_JSON));
|
||||
expect(r.escalated).toBe(true);
|
||||
expect(r.tier).toBe('deep');
|
||||
expect(r.model).toBe('claude');
|
||||
expect(r.triplets.length).toBe(1);
|
||||
});
|
||||
|
||||
it('淺萃失敗且無 deep caller → throw', async () => {
|
||||
await expect(extract('原文', caller('haiku', 'garbage'))).rejects.toThrow();
|
||||
});
|
||||
|
||||
it('端點對齊護欄:模型吐對不齊端點 → 自動補進 nodes', async () => {
|
||||
const skewed = JSON.stringify({
|
||||
nodes: [{ name: 'A' }],
|
||||
triplets: [{ subject: 'A', predicate: '連到', object: 'B(沒在 nodes)' }],
|
||||
});
|
||||
const r = await extract('原文', caller('haiku', skewed));
|
||||
// B 被自動補成 node → 端點全對齊
|
||||
expect(r.nodes.some((n) => n.name === 'B(沒在 nodes)')).toBe(true);
|
||||
});
|
||||
});
|
||||
@@ -0,0 +1,43 @@
|
||||
import { describe, it, expect } from 'vitest';
|
||||
import { makeGraphClient } from '../src/lib/graph-client';
|
||||
import type { Envelope } from '../src/types';
|
||||
|
||||
const env: Envelope = {
|
||||
source: { uri: 'github:o/r@a.md', content_hash: 'abc' },
|
||||
extractor: { model: 'local-harvest', tier: 'shallow' },
|
||||
triplets: [{ subject: 'A', predicate: 'p', object: 'B' }],
|
||||
};
|
||||
|
||||
function mockFetch(status: number, body: unknown): typeof fetch {
|
||||
return (async () =>
|
||||
new Response(JSON.stringify(body), { status, headers: { 'Content-Type': 'application/json' } })) as any;
|
||||
}
|
||||
|
||||
describe('makeGraphClient', () => {
|
||||
it('GRAPH_BASE_URL 未設 → 誠實回 ok:false,不假綠、不打網路', async () => {
|
||||
let called = false;
|
||||
const client = makeGraphClient(undefined, undefined, (async () => {
|
||||
called = true;
|
||||
return new Response('{}');
|
||||
}) as any);
|
||||
const r = await client.postEnvelope(env);
|
||||
expect(r.ok).toBe(false);
|
||||
expect(r.error).toContain('未設');
|
||||
expect(called).toBe(false);
|
||||
});
|
||||
|
||||
it('200 → ok + 帶 graph 回的 {skipped,ingested,deprecated}', async () => {
|
||||
const client = makeGraphClient('https://graph.example', 'tok', mockFetch(200, { skipped: false, ingested: 1, deprecated: 0 }));
|
||||
const r = await client.postEnvelope(env);
|
||||
expect(r.ok).toBe(true);
|
||||
expect((r.body as any).ingested).toBe(1);
|
||||
});
|
||||
|
||||
it('422 → ok:false 帶 issues(供修禁送欄位)', async () => {
|
||||
const client = makeGraphClient('https://graph.example', undefined, mockFetch(422, { error: 'invalid envelope', issues: [{ path: ['bridge_score'] }] }));
|
||||
const r = await client.postEnvelope(env);
|
||||
expect(r.ok).toBe(false);
|
||||
expect(r.status).toBe(422);
|
||||
expect((r.body as any).issues).toBeDefined();
|
||||
});
|
||||
});
|
||||
@@ -0,0 +1,68 @@
|
||||
import { describe, it, expect } from 'vitest';
|
||||
import { harvestCard, parseEntities, parseEdges, parseFrontmatter } from '../src/lib/harvest';
|
||||
|
||||
const CARD = `---
|
||||
tags: [掛載架構, 架構設計]
|
||||
gloss: ingest 在 KBDB 堆疊裡的位置。
|
||||
---
|
||||
# 掛載架構
|
||||
|
||||
← [[ingest/00-INDEX]]
|
||||
|
||||
## 摘要
|
||||
KBDB 是三層堆疊。
|
||||
|
||||
## 實體
|
||||
- **kbdb-ingest-plugin**(餵食器) — 最薄一層,純 POST 候選。
|
||||
- **base KBDB**(arcrun/kbdb/基本盤) — 最底儲存層。
|
||||
|
||||
## 關聯
|
||||
### 內文知識關係
|
||||
- kbdb-ingest-plugin >> 掛載於 >> base KBDB
|
||||
### 卡片關係
|
||||
- [[掛載架構]] >> 受約束於 >> [[envelope-契約]]
|
||||
`;
|
||||
|
||||
describe('parseFrontmatter', () => {
|
||||
it('抽出 gloss', () => {
|
||||
const { fm, body } = parseFrontmatter(CARD);
|
||||
expect(fm.gloss).toBe('ingest 在 KBDB 堆疊裡的位置。');
|
||||
expect(body).toContain('# 掛載架構');
|
||||
});
|
||||
});
|
||||
|
||||
describe('parseEntities', () => {
|
||||
it('解析正規名 + aliases + gloss', () => {
|
||||
const { body } = parseFrontmatter(CARD);
|
||||
const nodes = parseEntities(body);
|
||||
expect(nodes.map((n) => n.name)).toEqual(['kbdb-ingest-plugin', 'base KBDB']);
|
||||
expect(nodes[1].aliases).toEqual(['arcrun/kbdb', '基本盤']);
|
||||
expect(nodes[0].gloss).toBe('最薄一層,純 POST 候選。');
|
||||
expect(nodes[0].embed).toBe(true);
|
||||
});
|
||||
});
|
||||
|
||||
describe('parseEdges', () => {
|
||||
it('解析 typed-edge、去 [[ ]]、標記卡對卡', () => {
|
||||
const { body } = parseFrontmatter(CARD);
|
||||
const edges = parseEdges(body);
|
||||
expect(edges).toContainEqual({ subject: 'kbdb-ingest-plugin', predicate: '掛載於', object: 'base KBDB', predicate_embed: true, subjectIsCard: false, objectIsCard: false });
|
||||
expect(edges).toContainEqual({ subject: '掛載架構', predicate: '受約束於', object: 'envelope-契約', predicate_embed: true, subjectIsCard: true, objectIsCard: true });
|
||||
});
|
||||
});
|
||||
|
||||
describe('harvestCard', () => {
|
||||
it('卡標題 node 帶 frontmatter gloss、含內文 node', () => {
|
||||
const r = harvestCard(CARD);
|
||||
const titleNode = r.nodes.find((n) => n.name === '掛載架構');
|
||||
expect(titleNode?.gloss).toBe('ingest 在 KBDB 堆疊裡的位置。');
|
||||
expect(r.nodes.some((n) => n.name === 'base KBDB')).toBe(true);
|
||||
expect(r.triplets.length).toBe(2);
|
||||
});
|
||||
|
||||
it('內文端點對齊(無對不齊)', () => {
|
||||
const r = harvestCard(CARD);
|
||||
// kbdb-ingest-plugin / base KBDB 都在 ## 實體;卡對卡端點不要求
|
||||
expect(r.unalignedEndpoints).toEqual([]);
|
||||
});
|
||||
});
|
||||
@@ -0,0 +1,73 @@
|
||||
import { describe, it, expect } from 'vitest';
|
||||
import { makeSourceUri, parseSourceUri, contentHash, pullRepoMarkdown, type GitHubFetcher } from '../src/lib/source-adapter';
|
||||
import { processSource } from '../src/lib/pipeline';
|
||||
import type { LlmCaller } from '../src/lib/extract';
|
||||
|
||||
describe('source-adapter uri', () => {
|
||||
it('makeSourceUri / parseSourceUri round-trip', () => {
|
||||
const uri = makeSourceUri('uncle6me-web', 'kbdb-ingest-plugin', 'system-dev/wiki/cards/ingest/掛載架構.md');
|
||||
expect(uri).toBe('github:uncle6me-web/kbdb-ingest-plugin@system-dev/wiki/cards/ingest/掛載架構.md');
|
||||
expect(parseSourceUri(uri)).toEqual({
|
||||
owner: 'uncle6me-web',
|
||||
repo: 'kbdb-ingest-plugin',
|
||||
path: 'system-dev/wiki/cards/ingest/掛載架構.md',
|
||||
});
|
||||
});
|
||||
|
||||
it('content-hash 穩定且隨內容變', async () => {
|
||||
const a = await contentHash('hello');
|
||||
expect(a).toBe(await contentHash('hello'));
|
||||
expect(a).not.toBe(await contentHash('world'));
|
||||
});
|
||||
});
|
||||
|
||||
const HARVEST_CARD = `---
|
||||
gloss: 卡標題定義。
|
||||
---
|
||||
# 卡A
|
||||
## 實體
|
||||
- **甲** — 甲的定義。
|
||||
- **乙** — 乙的定義。
|
||||
## 關聯
|
||||
- 甲 >> 連到 >> 乙
|
||||
`;
|
||||
|
||||
function mockFetcher(files: Record<string, string>): GitHubFetcher {
|
||||
return {
|
||||
async listMarkdown() {
|
||||
return Object.keys(files);
|
||||
},
|
||||
async getFile(_o, _r, path) {
|
||||
return { text: files[path], commit: 'sha1' };
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
describe('pullRepoMarkdown + processSource', () => {
|
||||
it('採取路徑 A:拉檔 → harvest → envelope(不 extract)', async () => {
|
||||
const sources = await pullRepoMarkdown(mockFetcher({ 'cards/a.md': HARVEST_CARD }), 'o', 'r');
|
||||
expect(sources.length).toBe(1);
|
||||
const result = await processSource(sources[0]);
|
||||
expect(result.path).toBe('harvest');
|
||||
expect(result.envelope?.triplets).toEqual([{ subject: '甲', predicate: '連到', object: '乙', predicate_embed: true }]);
|
||||
expect(result.envelope?.extractor.model).toBe('local-harvest');
|
||||
});
|
||||
|
||||
it('採不到三元組 + 無萃取模型 → skipped(不假萃)', async () => {
|
||||
const sources = await pullRepoMarkdown(mockFetcher({ 'plain.md': '# 純文字\n沒有三元組。' }), 'o', 'r');
|
||||
const result = await processSource(sources[0]);
|
||||
expect(result.path).toBe('skipped');
|
||||
expect(result.envelope).toBeNull();
|
||||
});
|
||||
|
||||
it('採不到 → fallback extract(路徑 B)', async () => {
|
||||
const caller: LlmCaller = {
|
||||
model: 'haiku',
|
||||
call: async () => JSON.stringify({ nodes: [{ name: '甲' }], triplets: [{ subject: '甲', predicate: '是', object: '乙' }] }),
|
||||
};
|
||||
const sources = await pullRepoMarkdown(mockFetcher({ 'plain.md': '# 純文字\n甲是乙。' }), 'o', 'r');
|
||||
const result = await processSource(sources[0], { shallowCaller: caller });
|
||||
expect(result.path).toBe('extract');
|
||||
expect(result.envelope?.extractor.model).toBe('haiku');
|
||||
});
|
||||
});
|
||||
@@ -0,0 +1,45 @@
|
||||
import { describe, it, expect } from 'vitest';
|
||||
import { weave, flattenForPost, type RepoEnvelopes } from '../src/lib/weave';
|
||||
import type { Envelope } from '../src/types';
|
||||
|
||||
function env(uri: string, nodes: string[], triplets: Array<[string, string, string]>): Envelope {
|
||||
return {
|
||||
source: { uri, content_hash: uri },
|
||||
extractor: { model: 'local-harvest', tier: 'shallow' },
|
||||
nodes: nodes.map((n) => ({ name: n, embed: true })),
|
||||
triplets: triplets.map(([s, p, o]) => ({ subject: s, predicate: p, object: o })),
|
||||
};
|
||||
}
|
||||
|
||||
const repos: RepoEnvelopes[] = [
|
||||
{ repo: 'o/repoA', envelopes: [env('github:o/repoA@x.md', ['Arcrun', '餵食器'], [['Arcrun', '包含', '餵食器']])] },
|
||||
{ repo: 'o/repoB', envelopes: [env('github:o/repoB@y.md', ['Arcrun', '圖層'], [['Arcrun', '依賴', '圖層']])] },
|
||||
];
|
||||
|
||||
describe('weave', () => {
|
||||
it('偵測跨庫橋(同名節點跨 ≥2 repo)', () => {
|
||||
const r = weave(repos);
|
||||
const bridge = r.bridges.find((b) => b.node === 'Arcrun');
|
||||
expect(bridge?.repos).toEqual(['o/repoA', 'o/repoB']);
|
||||
expect(r.totalTriplets).toBe(2);
|
||||
});
|
||||
|
||||
it('偵測跨庫異見(同 s/o 對、不同謂詞跨 repo)', () => {
|
||||
const diverge: RepoEnvelopes[] = [
|
||||
{ repo: 'o/repoA', envelopes: [env('github:o/repoA@x.md', ['X', 'Y'], [['X', '支持', 'Y']])] },
|
||||
{ repo: 'o/repoB', envelopes: [env('github:o/repoB@y.md', ['X', 'Y'], [['X', '反對', 'Y']])] },
|
||||
];
|
||||
const r = weave(diverge);
|
||||
expect(r.divergences.length).toBe(1);
|
||||
expect(r.divergences[0].predicatesByRepo.map((p) => p.predicate).sort()).toEqual(['反對', '支持']);
|
||||
});
|
||||
|
||||
it('flattenForPost 攤平所有 envelope(順序穩定)', () => {
|
||||
expect(flattenForPost(repos).length).toBe(2);
|
||||
});
|
||||
|
||||
it('ingest 不算 bridge_score(橋只標 repos,無分數欄位)', () => {
|
||||
const r = weave(repos);
|
||||
expect(r.bridges[0]).not.toHaveProperty('bridge_score');
|
||||
});
|
||||
});
|
||||
@@ -0,0 +1,16 @@
|
||||
{
|
||||
"compilerOptions": {
|
||||
"target": "ESNext",
|
||||
"module": "ESNext",
|
||||
"moduleResolution": "bundler",
|
||||
"strict": true,
|
||||
"esModuleInterop": true,
|
||||
"skipLibCheck": true,
|
||||
"forceConsistentCasingInFileNames": true,
|
||||
"outDir": "dist",
|
||||
"rootDir": "src",
|
||||
"types": ["@cloudflare/workers-types"]
|
||||
},
|
||||
"include": ["src/**/*.ts"],
|
||||
"exclude": ["node_modules", "dist", "tests"]
|
||||
}
|
||||
@@ -0,0 +1,9 @@
|
||||
import { defineConfig } from 'vitest/config';
|
||||
|
||||
// ingest 純餵食器:不綁 D1/Vectorize/AI。測試走純 node + mock(fetch / graph client)。
|
||||
export default defineConfig({
|
||||
test: {
|
||||
environment: 'node',
|
||||
include: ['tests/**/*.test.ts'],
|
||||
},
|
||||
});
|
||||
@@ -0,0 +1,24 @@
|
||||
name = "kbdb-ingest-plugin"
|
||||
main = "src/index.ts"
|
||||
compatibility_date = "2025-02-19"
|
||||
compatibility_flags = ["nodejs_compat"]
|
||||
workers_dev = true
|
||||
|
||||
# KBDB-ingest 插件 = 純餵食器:GitHub 拉 + 採取/萃取 + 跨庫織網 → POST envelope 給 graph。
|
||||
# 鐵律:不碰儲存(無 D1/Vectorize/AI 綁定——那些屬 base/graph,ingest 不直連)。
|
||||
# 部署走 wrangler,繞 GitHub Actions(被 flag 教訓)。
|
||||
|
||||
[vars]
|
||||
ENVIRONMENT = "development"
|
||||
# graph 插件寫入端 base URL(POST {GRAPH_BASE_URL}/triplets/ingest)。
|
||||
# 部署前用 `wrangler secret put` 或在此填,例如 https://kbdb-graph.<acct>.workers.dev
|
||||
GRAPH_BASE_URL = ""
|
||||
# 萃取(路徑 B)預設模型意圖。"shallow"=Haiku/Workers AI;"deep"=Claude via CC。
|
||||
DEFAULT_EXTRACT_TIER = "shallow"
|
||||
|
||||
[alias]
|
||||
"zod/v3" = "zod"
|
||||
"zod/v4" = "zod"
|
||||
"zod/v4-mini" = "zod"
|
||||
|
||||
# GITHUB_TOKEN / GRAPH_INTERNAL_TOKEN / ANTHROPIC 等機敏值走 `wrangler secret put`,不寫這裡。
|
||||
Reference in New Issue
Block a user