feat(ingest): T0.5–T5 純餵食器管線實作(issue #2)

ingest 全管線(採取優先、extract fallback、跨庫織網、POST envelope):
- T0.5 骨架:Hono + zod-openapi,無 D1/Vectorize/AI 綁定(不碰儲存鐵律)
- T1 SourceAdapter:GitHub runtime API 拉 + per-file sha256 content-hash + /refresh 受理端
- T2 採取(路徑 A 優先):harvest template 1.8.0+ 卡(gloss/實體/typed-edge)
- T3 extract(路徑 B fallback):LlmCaller 可選模型 + JSON-fail 升級閘 + 端點對齊硬自檢護欄;第一版不 embed(只打標)
- T4 跨庫織網(主職):匯總多 repo → 偵測跨庫橋/異見,不算 bridge_score(graph 領域)
- T5 輸出:buildEnvelope strict + 顯式禁送欄位自檢;graph-client 純 POST(cherry-pick _kbdb_client.py 改不碰 base);薄 ops CLI(不帶查詢 MCP)

envelope 對齊 full contract(embed/id/aliases/predicate_embed);同步 contract 向量化欄位升格。

gate:vitest 28 passed / tsc clean / wrangler dry-run 乾淨(只 env-var 綁定)。
端到端 ingest→graph:graph receiver 已補對齊 → 待 ingest 部署 + GRAPH_BASE_URL → 待部署驗,未假綠。

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-26 20:40:53 +08:00
parent dffefdcdc2
commit 16ad1cb208
24 changed files with 4003 additions and 28 deletions
+26 -5
View File
@@ -60,7 +60,7 @@
}, },
"nodes": { "nodes": {
"type": "array", "type": "array",
"description": "節點層附帶資訊(選填)。entity_type 與 gloss 是【節點】屬性,不是【邊】屬性 → 放這裡,不放 triplets。graph 用 gloss 去 embed(每節點一句,不是裸詞)、用 entity_type 去 typing。", "description": "節點層附帶資訊。【向量化分工(leo 2026-06-26,ingest#1 升格成契約)】ingest 在此【打標】哪些 token 要向量化 + embed 什麼;base/KBDB embed 模組【讀標執行】實際 embedding;ingest 自己不算向量。兩類節點(實體詞條 / wikilink 卡)都進 nodes[],謂詞向量見 triplets[].predicate_vector。",
"items": { "items": {
"type": "object", "type": "object",
"required": ["name"], "required": ["name"],
@@ -69,11 +69,26 @@
"name": { "name": {
"type": "string", "type": "string",
"minLength": 1, "minLength": 1,
"description": "節點名(須對應某 triplet 的 subjectobject 原字面)。" "description": "節點名(須對應某 triplet 的 subject/object 原字面)。實體詞條=正規名;wikilink 卡=卡標題。"
},
"id": {
"type": "string",
"description": "去重鍵。wikilink 卡用【檔名】→ 一卡一 node,被多條邊指到也只 embed 一次,不以出現次數重複。實體詞條用正規名。選填(無則以 name 去重)。"
}, },
"gloss": { "gloss": {
"type": "string", "type": "string",
"description": "一句話描述,供 embedding。例如 'Graph RAG — 用關係遍歷檢索、保住異見的 RAG 變體'。選填(建議 deep tier 產)。" "description": "一句話描述。base embed 對【名 + gloss 一起】embedding(實體同義詞字面差太遠,靠描述拉近)。選填(建議 deep tier 產)。"
},
"aliases": {
"type": "array",
"items": { "type": "string" },
"description": "同義詞(如『黃仁勳』/『Jensen Huang』)。base 歸一(collapse)成同一 node。選填。"
},
"embed": {
"type": "boolean",
"default": true,
"description": "【向量化打標】此節點要不要進向量庫。true=base 讀標去 embed(名+gloss);false=base 看到就不理(如結構符號/散文不該進 nodes[],真進了標 false)。預設 true(實體詞條與 wikilink 卡都要)。",
"$comment": "ingest 打標,base 讀標執行。embed 動作歸 base embed 模組,ingest 不算向量。"
}, },
"entity_type": { "entity_type": {
"type": "string", "type": "string",
@@ -86,7 +101,7 @@
"triplets": { "triplets": {
"type": "array", "type": "array",
"minItems": 1, "minItems": 1,
"description": "邊(關係)。ingest 只產原始 (s,p,o) + confidence。", "description": "邊(關係)。ingest 只產原始 (s,p,o) + confidence + 謂詞向量打標。端點(s/o)以字面 match nodes[].name。",
"items": { "items": {
"type": "object", "type": "object",
"required": ["subject", "predicate", "object"], "required": ["subject", "predicate", "object"],
@@ -95,10 +110,16 @@
"subject": { "type": "string", "minLength": 1, "description": "主詞(實體名,須與 nodes[].name 對得上若有提供)" }, "subject": { "type": "string", "minLength": 1, "description": "主詞(實體名,須與 nodes[].name 對得上若有提供)" },
"predicate": { "type": "string", "minLength": 1, "description": "謂詞(關係)" }, "predicate": { "type": "string", "minLength": 1, "description": "謂詞(關係)" },
"object": { "type": "string", "minLength": 1, "description": "受詞(目標實體或值)" }, "object": { "type": "string", "minLength": 1, "description": "受詞(目標實體或值)" },
"predicate_embed": {
"type": "boolean",
"default": true,
"description": "【謂詞向量化打標】謂詞要不要 embed。base 讀標 → embed【謂詞裸詞,無描述】(謂詞同義詞字面本就近,如『參考』/『參照』,裸詞 embed 即自動聚類),存 edge 的 predicate_vector。為支援『關係過濾』查詢(查『參考』不漏『參照』)→ 預設 true。embed 動作歸 base,ingest 只打標。",
"$comment": "ingest 打標,base 讀標執行 embed。"
},
"confidence":{ "type": "number", "minimum": 0, "maximum": 1, "default": 1.0, "description": "萃取可信度。淺萃可附自評;graph 不據此過濾,只記錄。" } "confidence":{ "type": "number", "minimum": 0, "maximum": 1, "default": 1.0, "description": "萃取可信度。淺萃可附自評;graph 不據此過濾,只記錄。" }
} }
} }
} }
}, },
"$comment": "禁止欄位(graph 領域,ingest 絕不可送): id / clusters / bridge_score / created_at / updated_at / 以及 triplet 上的 subject_entity_type|object_entity_type(類型只走 nodes[])。送了即違反 ingest=純餵食器的邊界,graph 應拒收或忽略。" "$comment": "禁止欄位(graph 領域,ingest 絕不可送): id(節點去重鍵的 id 例外,那是 ingest 提供的去重鍵非 record id) / clusters / bridge_score / created_at / updated_at / 以及 triplet 上的 subject_entity_type|object_entity_type(類型只走 nodes[])。【向量化分工】ingest 打標(embed/predicate_embed + 帶 gloss/aliases),base/KBDB embed 模組讀標執行 embedding,ingest 不算向量。結構符號(>>/←)與給人讀的散文(## 摘要)不進 envelope。"
} }
+26 -23
View File
@@ -2,44 +2,47 @@
> 唯一進度來源。狀態:[ ] 未開始 [🔄] 進行中 [x] 完成 [⏸] 卡住 > 唯一進度來源。狀態:[ ] 未開始 [🔄] 進行中 [x] 完成 [⏸] 卡住
> 跨專案藍圖:InkStoneCo `docs/3-specs/mira-dissolve/`。 > 跨專案藍圖:InkStoneCo `docs/3-specs/mira-dissolve/`。
> 實作分支:`claude/ingest-t1-t5-implementation`vitest 28 passed / tsc clean / dry-run 乾淨)。
## T0 repo 骨架(本輪) ## T0 repo 骨架
- [x] 0.1 建 public repo `uncle6me-web/kbdb-ingest-plugin` - [x] 0.1 建 public repo `uncle6me-web/kbdb-ingest-plugin`
- [x] 0.2 CLAUDE.md(上游指針 + ingest 鐵律)+ README + .gitignore - [x] 0.2 CLAUDE.md(上游指針 + ingest 鐵律)+ README + .gitignore
- [x] 0.3 `contracts/ingest-candidate.json`(從頂層 SDD 複製,凍結契約) - [x] 0.3 `contracts/ingest-candidate.json`(從頂層 SDD 複製,凍結契約)
- [x] 0.4 SDD 三件式骨架 - [x] 0.4 SDD 三件式骨架`docs/3-specs/ingest-pipeline/`
- [ ] 0.5 package.json / tsconfig / wrangler.toml(參考 kbdb-graph-plugin - [x] 0.5 package.json / tsconfig / wrangler.toml / vitest.config(參考 kbdb-graph-pluginHono + zod-openapi,無 D1/Vectorize/AI 綁定
## T1 SourceAdapterR1 ## T1 SourceAdapterR1— `src/lib/source-adapter.ts`
- [ ] 1.1 GitHub 拉 reporuntime API/clone,非 Actions - [x] 1.1 GitHub 拉 reporuntime git/trees + contents API,非 Actions);GitHubFetcher 介面(測試走 mock
- [ ] 1.2 content-hashper-filesource.uri = github:owner/repo@path - [x] 1.2 content-hashper-file sha256source.uri = github:owner/repo@pathmakeSourceUri/parseSourceUri round-trip
- [ ] 1.3 被 KBDB MCP `refresh` 代轉觸發的接口 - [x] 1.3 被 graph `POST /graph/refresh` 代轉觸發的受理端:`POST /refresh``src/index.ts`,被動代轉、無排程)
## T2 採取(R2,路徑 A 優先) ## T2 採取(R2,路徑 A 優先)— `src/lib/harvest.ts`
- [ ] 2.1 本地 CC 已建三元組 + gloss(用了 system-dev-template 的 repo - [x] 2.1 採取本地 CC 已建三元組 + gloss(template 1.8.0+ 格式:frontmatter gloss、`## 實體``## 關聯` typed-edge;卡對卡 vs 內文端點分流
- [ ] 2.2 cherry-pick `polaris/mira/tools/_kbdb_client.py` → 改純餵食器(POST envelope,不寫 KBDB - [x] 2.2 cherry-pick `_kbdb_client.py` → 改純餵食器 `src/lib/graph-client.ts`POST envelope**不寫 KBDB/base**
## T3 extractR3,路徑 B fallback ## T3 extractR3,路徑 B fallback— `src/lib/extract.ts`
- [ ] 3.1 cherry-pick `wiki_synthesis.yaml` classify / 兩 skill block - [x] 3.1 cherry-pick `wiki_synthesis.yaml` classify 模式 → extract promptJSON nodes[]+triplets[]
- [ ] 3.2 模型用戶可選 + 品質門檻白名單(預設 Haiku,深萃 Claude via CC - [x] 3.2 模型用戶可選(意圖非型號,LlmCaller 介面,預設 shallow/Haiku、deep/Claude via CC
- [ ] 3.3 模型測試集(中文 + 人類暗示樣本,轉回歸測試)— deferred,先跑預設 - [ ] 3.3 模型測試集(中文 + 人類暗示樣本,轉回歸測試)— **deferred**(先跑預設;護欄 + parse 已有單元測試)
- [ ] 3.4 JSON-fail 升級閘(淺萃失敗升 deep - [x] 3.4 JSON-fail 升級閘(淺萃 fail/過稀 → 升 deep 一次
- [ ] 3.5 第一版不 embedembed base vectorizeInkStoneCo T2.4 - [x] 3.5 第一版不 embed仍【打標】embed/predicate_embed 供未來 base 讀標;embed 動作等 Arcrun #7
- [x] 3.x 端點對齊硬自檢護欄(`src/lib/endpoint-check.ts`leo 壓測 14→0;自檢 + autoAlign 補齊)
## T4 跨 repo 織網(R4,主職) ## T4 跨 repo 織網(R4,主職)— `src/lib/weave.ts`
- [ ] 4.1 匯總多 repo 三元組 - [x] 4.1 匯總多 repo 三元組 → 偵測跨庫橋(同名 node 跨 ≥2 repo)+ 異見(同 s/o 對、不同謂詞);**不算 bridge_score**graph 領域,禁送)
## T5 輸出 + CLIR5/R6 ## T5 輸出 + CLIR5/R6
- [ ] 5.1 POST envelope 給 graph `POST /triplets/ingest`(嚴格符合 contract)⏸ 待 graph 寫入端(InkStoneCo T3.3 - [x] 5.1 POST envelope 給 graph `POST /triplets/ingest`(嚴格符合 contractbuildEnvelope strict + 顯式禁送欄位自檢提早攔)。對齊【full contract】(含 embed/id/aliases/predicate_embed,總管裁定 ingest 不退
- [ ] 5.2 薄 ops CLI手動重萃);不帶查詢 MCP - [x] 5.2 薄 ops CLI`scripts/ingest-cli.mjs`refresh 經 Worker / pull dry-run);**不帶查詢 MCP**
## 阻擋項 ## 阻擋項 / 誠實標記
1.T5.1 依賴 graph `POST /triplets/ingest`InkStoneCo T3,待 graph repo 實作) 1.**端到端 ingest→graph 走通**graph receiver 已補對齊 full contract → 剩 ingest 部署 + `GRAPH_BASE_URL` 設定 → **待部署驗**,未假綠
2. ⏸ embed 依賴 base vectorizeInkStoneCo T2.4)。第一版不 embed 可先動。 2. ⏸ embed 依賴 base vectorizeArcrun #7)。第一版不 embed(只打標)已動。
3. T3.3 模型測試集 deferredrefresh 端 extractWorkers AI)第一版只走採取,深萃留 CLI/CC。
+2673
View File
File diff suppressed because it is too large Load Diff
+25
View File
@@ -0,0 +1,25 @@
{
"name": "kbdb-ingest-plugin",
"version": "0.1.0",
"private": true,
"description": "KBDB-ingest 插件:純餵食器——GitHub 拉 + 採取/萃取三元組候選 + 跨庫織網 → POST envelope 給 kbdb-graph-plugin。不碰儲存。",
"type": "module",
"scripts": {
"dev": "wrangler dev",
"deploy": "wrangler deploy",
"test": "vitest run",
"test:watch": "vitest",
"ingest": "node scripts/ingest-cli.mjs"
},
"dependencies": {
"@hono/zod-openapi": "^1.2.4",
"hono": "^4.7.0",
"zod": "^4.3.6"
},
"devDependencies": {
"@cloudflare/workers-types": "^4.20250219.0",
"typescript": "^5.7.0",
"vitest": "^3.1.0",
"wrangler": "^4.0.0"
}
}
+117
View File
@@ -0,0 +1,117 @@
#!/usr/bin/env node
// 薄 ops CLI(T5.2)— 人手動觸發重萃。不帶查詢 MCP(ambient 餵食器沒人「問」它)。
//
// 兩種模式:
// ingest refresh <github:owner/repo@path> 經部署的 Worker /refresh 重萃單一來源
// ingest pull <owner/repo> [root] 本地 dry-run:拉 + 列出會送的 envelope(不 POST
//
// 設定走 env
// KBDB_INGEST_URL 已部署的 ingest Worker baserefresh 模式用)
// GRAPH_BASE_URL graph 寫入端(pull --post 用)
// GITHUB_TOKEN 拉私庫用(公庫可空)
//
// 鐵律:CLI 不碰儲存;refresh 經 Worker、pull --post 經 graph 寫入端。觸發=人手動(無排程)。
import process from 'node:process';
const [, , cmd, arg, arg2] = process.argv;
async function sha256hex(text) {
const data = new TextEncoder().encode(text);
const digest = await crypto.subtle.digest('SHA-256', data);
return [...new Uint8Array(digest)].map((b) => b.toString(16).padStart(2, '0')).join('');
}
function ghHeaders() {
const h = { Accept: 'application/vnd.github+json', 'User-Agent': 'kbdb-ingest-cli' };
if (process.env.GITHUB_TOKEN) h.Authorization = `Bearer ${process.env.GITHUB_TOKEN}`;
return h;
}
async function ghGetFile(owner, repo, path) {
const url = `https://api.github.com/repos/${owner}/${repo}/contents/${path}`;
const res = await fetch(url, { headers: ghHeaders() });
if (!res.ok) throw new Error(`github ${owner}/${repo}@${path}: ${res.status}`);
const body = await res.json();
const text = body.encoding === 'base64' ? Buffer.from(body.content, 'base64').toString('utf-8') : body.content;
return { text, commit: body.sha };
}
async function ghListMarkdown(owner, repo, root = '') {
const res = await fetch(`https://api.github.com/repos/${owner}/${repo}/git/trees/HEAD?recursive=1`, { headers: ghHeaders() });
if (!res.ok) throw new Error(`github list ${owner}/${repo}: ${res.status}`);
const body = await res.json();
const prefix = root.replace(/^\/+|\/+$/g, '');
return (body.tree || [])
.filter((e) => e.type === 'blob' && e.path.endsWith('.md'))
.map((e) => e.path)
.filter((p) => (prefix ? p === prefix || p.startsWith(prefix + '/') : true));
}
// 極簡採取(鏡射 src/lib/harvest.tsCLI dry-run 用,不引 TS)。
function harvest(md) {
const fm = /^---\n([\s\S]*?)\n---\n?([\s\S]*)$/.exec(md);
const body = fm ? fm[2] : md;
const gloss = fm && /^gloss:\s*(.+)$/m.exec(fm[1]) ? /^gloss:\s*(.+)$/m.exec(fm[1])[1].trim() : undefined;
const title = /^#\s+(.+)$/m.exec(body)?.[1]?.trim();
const sec = (h) => new RegExp(`^##\\s+${h}[^\\n]*\\n([\\s\\S]*?)(?=\\n##\\s|$)`, 'm').exec(body)?.[1] || '';
const nodes = [];
if (title) nodes.push({ name: title, gloss, embed: true });
for (const line of sec('實體').split('\n')) {
const m = /^-\s*\*\*(.+?)\*\*\s*(?:(.+?))?\s*(?:[—-]\s*(.+))?$/.exec(line.trim());
if (m) nodes.push({ name: m[1].trim(), gloss: m[3]?.trim() || undefined, embed: true });
}
const triplets = [];
for (const line of sec('關聯').split('\n')) {
const m = /^(.+?)\s*>>\s*(.+?)\s*>>\s*(.+?)$/.exec(line.replace(/^-\s*/, '').trim());
if (m) {
const clean = (s) => s.replace(/\[\[|\]\]|\*\*/g, '').trim();
triplets.push({ subject: clean(m[1]), predicate: m[2].trim(), object: clean(m[3]), predicate_embed: true });
}
}
return { nodes, triplets };
}
async function doRefresh(uri) {
const base = process.env.KBDB_INGEST_URL;
if (!base) throw new Error('KBDB_INGEST_URL 未設(指向已部署的 ingest Worker');
const res = await fetch(base.replace(/\/$/, '') + '/refresh', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ uri }),
});
console.log(JSON.stringify(await res.json(), null, 2));
}
async function doPull(ownerRepo, root) {
const [owner, repo] = ownerRepo.split('/');
if (!owner || !repo) throw new Error('用法:ingest pull <owner/repo> [root]');
const paths = await ghListMarkdown(owner, repo, root || '');
console.error(`[ingest] ${owner}/${repo}: ${paths.length} 個 MD`);
const envelopes = [];
for (const path of paths) {
const { text, commit } = await ghGetFile(owner, repo, path);
const { nodes, triplets } = harvest(text);
if (!triplets.length) continue; // 採不到(非 template 卡)→ dry-run 跳過(CLI 不做 extract
envelopes.push({
source: { uri: `github:${owner}/${repo}@${path}`, content_hash: await sha256hex(text), commit },
extractor: { model: 'local-harvest', tier: 'shallow' },
nodes,
triplets,
});
}
console.error(`[ingest] 採取出 ${envelopes.length} 個 envelope(共 ${envelopes.reduce((n, e) => n + e.triplets.length, 0)} 三元組)`);
console.log(JSON.stringify(envelopes, null, 2));
}
try {
if (cmd === 'refresh' && arg) await doRefresh(arg);
else if (cmd === 'pull' && arg) await doPull(arg, arg2);
else {
console.error('用法:\n ingest refresh <github:owner/repo@path>\n ingest pull <owner/repo> [root]');
process.exit(2);
}
} catch (e) {
console.error('[ingest] 錯誤:', e.message);
process.exit(1);
}
+87
View File
@@ -0,0 +1,87 @@
// KBDB-ingest 插件 Worker 進入點 — 純餵食器。
//
// 鐵律:不碰儲存(無 D1/Vectorize/AI 綁定)。只 POST envelope 給 graph 寫入端。
// 端點:/refresh = graph 的 POST /graph/refresh 代轉過來的受理端(人發起、非自動 fan-out)。
// refresh 收到 {uri, owner_id} → 拉該來源 → 採取/萃取 → POST envelope 給 graph。
// 不帶查詢 MCPambient 餵食器);ops 走薄 CLIscripts/ingest-cli.mjs)。
import { OpenAPIHono, createRoute, z } from '@hono/zod-openapi';
import { cors } from 'hono/cors';
import type { Bindings, Variables } from './types';
import { makeGitHubFetcher, parseSourceUri, contentHash, makeSourceUri } from './lib/source-adapter';
import { processSource } from './lib/pipeline';
import { makeGraphClient } from './lib/graph-client';
const app = new OpenAPIHono<{ Bindings: Bindings; Variables: Variables }>();
app.onError((err, c) => {
console.error(err);
return c.json({ error: 'Internal Server Error', message: err.message }, 500);
});
app.use('*', cors({ origin: '*', allowHeaders: ['Content-Type', 'Authorization'], allowMethods: ['GET', 'POST', 'OPTIONS'] }));
app.get('/', (c) => c.json({ service: 'kbdb-ingest', tier: 'plugin', role: 'feeder', status: 'ok' }));
app.get('/health', (c) =>
c.json({ service: 'kbdb-ingest', status: 'ok', graph_url_set: Boolean(c.env.GRAPH_BASE_URL) }),
);
// POST /refresh — graph 代轉重萃某來源。被動:收一次調用 → 處理一次(無排程/webhook)。
const refreshRoute = createRoute({
method: 'post',
path: '/refresh',
request: {
body: {
content: {
'application/json': {
schema: z.object({
uri: z.string().min(1).describe("github:owner/repo@path"),
owner_id: z.string().optional(),
}),
},
},
},
},
responses: {
200: { description: 'Refreshed: pulled, harvested/extracted, posted envelope to graph' },
400: { description: 'Bad uri' },
},
tags: ['Ingest'],
});
app.openapi(refreshRoute, async (c) => {
const { uri } = c.req.valid('json');
const parsed = parseSourceUri(uri);
if (!parsed) return c.json({ error: 'uri 須為 github:owner/repo@path' }, 400);
const fetcher = makeGitHubFetcher(c.env.GITHUB_TOKEN);
const { text, commit } = await fetcher.getFile(parsed.owner, parsed.repo, parsed.path);
const file = {
uri: makeSourceUri(parsed.owner, parsed.repo, parsed.path),
path: parsed.path,
text,
content_hash: await contentHash(text),
commit,
};
// 第一版 refresh 只走採取(路徑 A);extract 模型在 Worker runtime 接 Workers AI 是後續
// CLI 端可帶 deep via CC)。採不到三元組 → 誠實回 skipped,不假萃。
const result = await processSource(file);
if (!result.envelope) {
return c.json({ refreshed: false, path: result.path, note: result.note }, 200);
}
const graph = makeGraphClient(c.env.GRAPH_BASE_URL, c.env.GRAPH_INTERNAL_TOKEN);
const post = await graph.postEnvelope(result.envelope);
return c.json(
{
refreshed: post.ok,
path: result.path,
triplets: result.envelope.triplets.length,
graph: post.ok ? post.body : { status: post.status, error: post.error, issues: (post.body as any)?.issues },
},
200,
);
});
export default app;
+49
View File
@@ -0,0 +1,49 @@
// 端點對齊硬自檢護欄(leo 真 vault 壓測實證:光寫規則 Haiku 會略過,端點對不齊 14 條;
// 寫成自檢動作後 14→0)。
//
// 規則:每條內文三元組的 subject/object 必須對得上某個 node 名(一字不差)。
// 對不齊 = 下游圖斷鏈(端點 match 不到 node)。本護欄在 envelope 出門前機械檢,
// 撈出對不齊的端點,呼叫端可選擇修補 / 丟棄 / warn。
import type { EnvelopeEdge, EnvelopeNode } from '../types';
export interface AlignmentReport {
aligned: boolean;
/** 對不齊的端點描述(給人讀 / log)。 */
unaligned: string[];
}
/**
* 檢查三元組端點是否都對得上 nodes[].name。
* 卡對卡端點(原文 `[[卡]]`)已在 harvest 去括號 → 一律以裸名比對。
*/
export function checkEndpointAlignment(nodes: EnvelopeNode[], triplets: EnvelopeEdge[]): AlignmentReport {
const names = new Set(nodes.map((n) => n.name));
const unaligned: string[] = [];
for (const t of triplets) {
for (const [role, ep] of [['subject', t.subject], ['object', t.object]] as const) {
if (!names.has(ep)) {
unaligned.push(`${role}${ep}」對不齊(${t.subject} >> ${t.predicate} >> ${t.object}`);
}
}
}
return { aligned: unaligned.length === 0, unaligned };
}
/**
* 自動補齊:對不齊的端點,把它當成新 node 補進 nodes[]embed:true,無 gloss)。
* 比丟棄三元組保守——保住邊,下游仍可 normalize。回傳補過的 nodes。
*/
export function autoAlignEndpoints(nodes: EnvelopeNode[], triplets: EnvelopeEdge[]): EnvelopeNode[] {
const names = new Set(nodes.map((n) => n.name));
const out = [...nodes];
for (const t of triplets) {
for (const ep of [t.subject, t.object]) {
if (!names.has(ep)) {
names.add(ep);
out.push({ name: ep, embed: true });
}
}
}
return out;
}
+51
View File
@@ -0,0 +1,51 @@
// envelope 組裝 + 出門前禁送欄位自檢。
//
// 一個 envelope = 一個來源檔一次萃取的產物(契約定義)。組裝後跑 EnvelopeSchema 驗證
// strict → 多帶禁送欄位會 throw,提早在 ingest 端攔,不等 graph 422)。
import {
EnvelopeSchema,
FORBIDDEN_EDGE_KEYS,
FORBIDDEN_TOP_KEYS,
type Envelope,
type EnvelopeEdge,
type EnvelopeNode,
} from '../types';
export interface BuildEnvelopeInput {
source: { uri: string; content_hash: string; anchor?: string; commit?: string; block_id?: string };
extractor: { model: string; tier: 'shallow' | 'deep'; extracted_at?: number };
nodes?: EnvelopeNode[];
triplets: EnvelopeEdge[];
}
/**
* 組 envelope 並驗證(strict)。
* - 結構符號/散文不該進;nodes/triplets 由上游(harvest/extract)已過濾。
* - 驗證失敗(多帶禁送欄位、形狀錯)→ throw ZodError,呼叫端攔(比送出去被 graph 422 早)。
*/
export function buildEnvelope(input: BuildEnvelopeInput): Envelope {
// 顯式禁送欄位自檢(除了 strict schema,多一道明確攔——上游若塞 graph 領域欄位提早炸)。
for (const n of input.nodes ?? []) {
for (const k of [...FORBIDDEN_TOP_KEYS, 'clusters']) {
if (k !== 'id' && k in (n as Record<string, unknown>)) {
throw new Error(`envelope: node「${n.name}」帶禁送欄位 ${k}graph 領域,ingest 不可送)`);
}
}
}
for (const t of input.triplets) {
for (const k of FORBIDDEN_EDGE_KEYS) {
if (k in (t as Record<string, unknown>)) {
throw new Error(`envelope: 邊「${t.subject}>>${t.object}」帶禁送欄位 ${k}(類型只走 nodes[]`);
}
}
}
const candidate: Envelope = {
source: input.source,
extractor: input.extractor,
triplets: input.triplets,
...(input.nodes && input.nodes.length ? { nodes: input.nodes } : {}),
};
// strict 驗證:等於本地版「禁送欄位 → 擋」。throw 給呼叫端。
return EnvelopeSchema.parse(candidate);
}
+110
View File
@@ -0,0 +1,110 @@
// T3 extract(路徑 Bfallback)— 裸原文無本地三元組時,ingest 自己萃 (s,p,o)+gloss。
//
// 模型用戶可選(意圖非型號):shallow=Haiku/Workers AI(預設、便宜);deep=Claude via CC(深萃、走月費)。
// JSON-fail 升級閘:shallow 解析失敗 / 萃太稀 → 升 deep 重萃一次。
// 第一版不 embedembed 等 base vectorize / Arcrun #7)——但仍【打標】embed/predicate_embed 供未來讀標。
// 端點對齊護欄:萃完用 endpoint-check 自檢 + 自動補齊(leo 壓測 14→0)。
//
// LLM 呼叫抽象成 LlmCaller 介面 → 測試走 mock,不打網路、不花錢。
import type { EnvelopeEdge, EnvelopeNode } from '../types';
import { autoAlignEndpoints, checkEndpointAlignment } from './endpoint-check';
export type ExtractTier = 'shallow' | 'deep';
export interface ExtractedGraph {
nodes: EnvelopeNode[];
triplets: EnvelopeEdge[];
}
/** 一次 LLM 萃取呼叫。回傳模型【原始文字】(期望是 JSON),由本模組負責 parse。 */
export interface LlmCaller {
/** model = 解析後的具體型號字串(供 extractor.model 記錄)。 */
readonly model: string;
call(prompt: string, text: string): Promise<string>;
}
export interface ExtractResult extends ExtractedGraph {
tier: ExtractTier;
model: string;
/** 是否因 shallow JSON-fail/過稀而升級到 deep。 */
escalated: boolean;
}
const EXTRACT_PROMPT = `你是知識圖譜萃取器。讀下面的原文,萃出三元組與實體。嚴格輸出 JSON(繁體中文內容),格式:
{
"nodes": [{"name": "正規名", "gloss": "一句話定義(這個實體是什麼)", "aliases": ["同義詞"]}],
"triplets": [{"subject": "主詞", "predicate": "動詞短語", "object": "受詞", "confidence": 0.0-1.0}]
}
規則:
- 謂詞用動詞/動詞短語(如「奠基於」「反駁」),禁名詞當謂詞。
- triplet 的 subject/object 必須對得上某個 nodes[].name(一字不差)。
- 抓深層暗示,不只表面陳述。只輸出 JSON,不要其他文字。`;
/** 解析模型輸出的 JSON(容忍 ```json fenced 區塊)。失敗 throw。 */
export function parseExtractJson(raw: string): ExtractedGraph {
const fenced = /```(?:json)?\s*([\s\S]*?)```/.exec(raw);
const jsonText = (fenced ? fenced[1] : raw).trim();
const parsed = JSON.parse(jsonText) as Partial<ExtractedGraph>;
if (!Array.isArray(parsed.triplets) || parsed.triplets.length === 0) {
throw new Error('extract: no triplets in model output');
}
const nodes: EnvelopeNode[] = (parsed.nodes ?? []).map((n) => ({
name: String(n.name),
gloss: n.gloss ? String(n.gloss) : undefined,
aliases: Array.isArray(n.aliases) ? n.aliases.map(String) : undefined,
embed: true, // 打標 truebase 讀標執行;第一版 base 還沒接,標仍合契約)
}));
const triplets: EnvelopeEdge[] = parsed.triplets.map((t) => ({
subject: String(t.subject),
predicate: String(t.predicate),
object: String(t.object),
confidence: typeof t.confidence === 'number' ? t.confidence : undefined,
predicate_embed: true,
}));
return { nodes, triplets };
}
/** 萃太稀(門檻)→ 視為失敗、觸發升級。 */
function tooSparse(g: ExtractedGraph): boolean {
return g.triplets.length < 1;
}
/**
* extract:先用 shallowCaller 淺萃;JSON-fail 或過稀 → 若有 deepCaller 升級重萃一次。
* 萃完跑端點對齊護欄並自動補齊。deepCaller 省略 = 不升級(純 shallow)。
*/
export async function extract(
text: string,
shallowCaller: LlmCaller,
deepCaller?: LlmCaller,
): Promise<ExtractResult> {
let tier: ExtractTier = 'shallow';
let model = shallowCaller.model;
let graph: ExtractedGraph | null = null;
let escalated = false;
try {
graph = parseExtractJson(await shallowCaller.call(EXTRACT_PROMPT, text));
if (tooSparse(graph)) throw new Error('extract: shallow too sparse');
} catch {
graph = null;
}
if (!graph && deepCaller) {
escalated = true;
tier = 'deep';
model = deepCaller.model;
graph = parseExtractJson(await deepCaller.call(EXTRACT_PROMPT, text)); // deep 失敗就 throw 給呼叫端
}
if (!graph) throw new Error('extract: shallow failed and no deep caller to escalate');
// 端點對齊護欄(leo 壓測必做):自檢 + 自動補齊(保住邊,不丟)。
const aligned = autoAlignEndpoints(graph.nodes, graph.triplets);
const report = checkEndpointAlignment(aligned, graph.triplets);
// 補齊後理應全對齊;若仍有(理論上不會)留給呼叫端,但不阻斷。
void report;
return { nodes: aligned, triplets: graph.triplets, tier, model, escalated };
}
+58
View File
@@ -0,0 +1,58 @@
// T5 graph client — cherry-pick 自 polaris/mira/tools/_kbdb_client.py 的 HTTP-helper 模式,
// 但【改成純餵食器】:只 POST envelope 給 graph 寫入端,**不寫 base、不碰 D1/Vectorize/表**。
//
// 原 _kbdb_client.py 直打 base /kbdb/entries(碰儲存)——那正是 ingest 鐵律禁止的。
// 本檔保留它的「統一 http wrapper + header + 容錯回傳」骨架,把目標改成 graph 的
// POST /triplets/ingestAPI-as-Wallingest 只透過 graph HTTP 寫入端餵候選)。
import type { Envelope } from '../types';
export interface PostResult {
ok: boolean;
/** graph 回的 {skipped,ingested,deprecated}200);422/未設時 ok=false。 */
status: number;
body?: unknown;
error?: string;
}
export interface GraphClient {
postEnvelope(env: Envelope): Promise<PostResult>;
}
/**
* graph clientbaseUrl {ok:false, error:'GRAPH_BASE_URL 未設'}
* graph refresh ingest URL forwarded:false
*/
export function makeGraphClient(
baseUrl: string | undefined,
token?: string,
fetchImpl: typeof fetch = fetch,
): GraphClient {
return {
async postEnvelope(env) {
if (!baseUrl) {
return { ok: false, status: 0, error: 'GRAPH_BASE_URL 未設:graph 寫入端尚未就緒/未部署,envelope 無對象可送。' };
}
const headers: Record<string, string> = { 'Content-Type': 'application/json' };
if (token) headers.Authorization = `Bearer ${token}`;
const url = baseUrl.replace(/\/$/, '') + '/triplets/ingest';
let res: Response;
try {
res = await fetchImpl(url, { method: 'POST', headers, body: JSON.stringify(env) });
} catch (e) {
return { ok: false, status: 0, error: `[graph] POST ${url}: ${(e as Error).message}` };
}
let body: unknown;
try {
body = await res.json();
} catch {
body = undefined;
}
// 422 = envelope 違規(禁送欄位/形狀)→ 不 ok,帶 graph 回的 issues 供修。
if (!res.ok) {
return { ok: false, status: res.status, body, error: `graph ${res.status} ${res.statusText}` };
}
return { ok: true, status: res.status, body };
},
};
}
+146
View File
@@ -0,0 +1,146 @@
// T2 採取(路徑 A,優先)— 從 system-dev-template 1.8.0+ 的 wiki 卡採取已建好的三元組+gloss。
//
// 本地萃成效更好(知識連結長在生產當下、有 LLM Wiki 指引),ingest 優先採取、不重萃。
// 解析卡片格式(與本 repo system-dev/wiki/cards 同源):
// frontmatter: gloss:(卡標題 node 的描述)
// ## 實體:一行一個 `- **正規名**(aliases…)— 描述句`(內文 node + gloss
// ## 關聯:typed-edge `A >> 謂詞 >> B`(內文裸文字端點)/ `[[卡]] >> 謂詞 >> [[卡]]`(卡對卡)
//
// 鐵律:結構符號(>>/←)與散文(## 摘要)不進 envelope。打標 embed/predicate_embed(預設 true)。
import type { EnvelopeEdge, EnvelopeNode } from '../types';
export interface HarvestResult {
nodes: EnvelopeNode[];
triplets: EnvelopeEdge[];
/** 端點對不齊 `## 實體` 的三元組(自檢護欄;見 endpoint-check.ts 用此 warn)。 */
unalignedEndpoints: string[];
}
interface Frontmatter {
gloss?: string;
tags?: string[];
}
/** 抽 frontmatter--- … ---)。簡單 YAML,只取 gloss / tags。 */
export function parseFrontmatter(md: string): { fm: Frontmatter; body: string } {
const m = /^---\n([\s\S]*?)\n---\n?([\s\S]*)$/.exec(md);
if (!m) return { fm: {}, body: md };
const fm: Frontmatter = {};
for (const line of m[1].split('\n')) {
const g = /^gloss:\s*(.+)$/.exec(line.trim());
if (g) fm.gloss = g[1].replace(/^["']|["']$/g, '').trim();
}
return { fm, body: m[2] };
}
/** 取卡標題(首個 # H1)。 */
export function parseTitle(body: string): string | null {
const m = /^#\s+(.+)$/m.exec(body);
return m ? m[1].trim() : null;
}
/** 抽某 H2 段落內文(到下個 H2 或檔尾)。H3 子節(### …)仍算段內。 */
function section(body: string, heading: string): string | null {
// 不用 m 旗標(避免 $ 在每行尾命中);終止 = 下個 `\n## `(H2,非 H3)或字串尾。
const re = new RegExp(`(?:^|\\n)##\\s+${heading}[^\\n]*\\n([\\s\\S]*?)(?=\\n##\\s|$)`);
const m = re.exec(body);
return m ? m[1] : null;
}
/** 解析 `## 實體` 行:`- **正規名**alias1alias2)— 描述句`。 */
export function parseEntities(body: string): EnvelopeNode[] {
const sec = section(body, '實體');
if (!sec) return [];
const out: EnvelopeNode[] = [];
for (const raw of sec.split('\n')) {
const line = raw.trim();
if (!line.startsWith('-')) continue;
// - **名**aliases)— gloss 或 - **名** — gloss 或 - **名**
const m = /^-\s*\*\*(.+?)\*\*\s*(?:(.+?))?\s*(?:[—-]\s*(.+))?$/.exec(line);
if (!m) continue;
const name = m[1].trim();
// 別名分隔用全形「/」「、」(template 慣例);ASCII '/' 不切(如 arcrun/kbdb 是一個別名)。
const aliases = m[2]
? m[2].split(/[/、]/).map((s) => s.trim()).filter(Boolean)
: undefined;
const gloss = m[3]?.trim() || undefined;
const node: EnvelopeNode = { name, embed: true };
if (gloss) node.gloss = gloss;
if (aliases && aliases.length) node.aliases = aliases;
out.push(node);
}
return out;
}
/** 一條解析出的邊 + 它的兩端是否為卡對卡(原文帶 [[ ]])。 */
export interface ParsedEdge extends EnvelopeEdge {
/** subject 端原文是 [[wikilink]](卡對卡,不要求對齊 ## 實體)。 */
subjectIsCard: boolean;
objectIsCard: boolean;
}
/** 解析 typed-edge 行 `A >> 謂詞 >> B`sep 可設,預設 >>)。端點去 `[[ ]]`、`**`。 */
export function parseEdges(body: string, sep = '>>'): ParsedEdge[] {
const sec = section(body, '關聯');
if (!sec) return [];
const out: ParsedEdge[] = [];
const escSep = sep.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
const re = new RegExp(`^(.+?)\\s*${escSep}\\s*(.+?)\\s*${escSep}\\s*(.+?)$`);
for (const raw of sec.split('\n')) {
const line = raw.trim();
if (!line.startsWith('-')) continue;
const m = re.exec(line.replace(/^-\s*/, ''));
if (!m) continue;
const clean = (s: string) => s.replace(/\[\[|\]\]/g, '').replace(/\*\*/g, '').trim();
out.push({
subject: clean(m[1]),
predicate: m[2].trim(),
object: clean(m[3]),
predicate_embed: true,
subjectIsCard: /\[\[.+?\]\]/.test(m[1]),
objectIsCard: /\[\[.+?\]\]/.test(m[3]),
});
}
return out;
}
/** 採取單張卡 → nodes + triplets(含卡標題 node 的 frontmatter gloss)。 */
export function harvestCard(md: string): HarvestResult {
const { fm, body } = parseFrontmatter(md);
const title = parseTitle(body);
const nodes = parseEntities(body);
// 卡標題本身是個 nodewikilink 卡)。frontmatter gloss 描述它。
if (title && !nodes.some((n) => n.name === title)) {
const cardNode: EnvelopeNode = { name: title, embed: true };
if (fm.gloss) cardNode.gloss = fm.gloss;
nodes.unshift(cardNode);
}
const parsed = parseEdges(body);
// 卡對卡端點(原文 [[卡]])也是 graph node(被連到的卡)→ 補進 nodesembed:true,無 gloss)。
// 這樣它們對齊、且下游知道有這些卡 node。
const nodeNames = new Set(nodes.map((n) => n.name));
for (const e of parsed) {
if (e.subjectIsCard && !nodeNames.has(e.subject)) { nodeNames.add(e.subject); nodes.push({ name: e.subject, embed: true }); }
if (e.objectIsCard && !nodeNames.has(e.object)) { nodeNames.add(e.object); nodes.push({ name: e.object, embed: true }); }
}
// 端點對齊自檢(leo 壓測護欄):內文三元組端點(非卡對卡)須對得上某 node 名。
const unalignedEndpoints: string[] = [];
for (const e of parsed) {
if (!e.subjectIsCard && !nodeNames.has(e.subject))
unalignedEndpoints.push(`${e.subject}(在「${e.subject} >> ${e.predicate} >> ${e.object}」)`);
if (!e.objectIsCard && !nodeNames.has(e.object))
unalignedEndpoints.push(`${e.object}(在「${e.subject} >> ${e.predicate} >> ${e.object}」)`);
}
// 去掉 ParsedEdge 的 isCard 標記 → 純 EnvelopeEdge。
const triplets: EnvelopeEdge[] = parsed.map(({ subject, predicate, object, predicate_embed, confidence }) => ({
subject, predicate, object, predicate_embed, ...(confidence !== undefined ? { confidence } : {}),
}));
return { nodes, triplets, unalignedEndpoints };
}
+59
View File
@@ -0,0 +1,59 @@
// 編排:source → 採取(路徑A優先) / 萃取(路徑B fallback) → envelope。
//
// 每個 SourceFile 出一個 envelope(契約:一檔一 envelope)。採取優先:卡有三元組就採;
// 採不到(無 ## 關聯 / 非 template 卡)才走 extract。跨 repo 織網在更上層(weave)匯總。
import { harvestCard } from './harvest';
import { extract, type LlmCaller } from './extract';
import { buildEnvelope } from './envelope';
import type { SourceFile } from './source-adapter';
import type { Envelope } from '../types';
export interface ProcessOptions {
shallowCaller?: LlmCaller;
deepCaller?: LlmCaller;
/** 採取(路徑 A)模型標記,記進 extractor.model。預設 'local-harvest'。 */
harvestModel?: string;
}
export interface ProcessResult {
envelope: Envelope | null;
path: 'harvest' | 'extract' | 'skipped';
note?: string;
}
/** 採取結果是否「夠」(有三元組)→ 不必 fallback 到 extract。 */
function harvestSufficient(triplets: unknown[]): boolean {
return triplets.length > 0;
}
/** 處理單一來源檔 → envelope(採取優先,採不到 fallback extract)。 */
export async function processSource(file: SourceFile, opts: ProcessOptions = {}): Promise<ProcessResult> {
// 路徑 A:採取本地已建三元組+gloss。
const harvested = harvestCard(file.text);
if (harvestSufficient(harvested.triplets)) {
const envelope = buildEnvelope({
source: { uri: file.uri, content_hash: file.content_hash, commit: file.commit },
extractor: { model: opts.harvestModel ?? 'local-harvest', tier: 'shallow' },
nodes: harvested.nodes,
triplets: harvested.triplets,
});
const note = harvested.unalignedEndpoints.length
? `採取:${harvested.unalignedEndpoints.length} 端點對不齊(已留 node`
: undefined;
return { envelope, path: 'harvest', note };
}
// 路徑 B:裸原文 extract(需 shallowCaller)。
if (!opts.shallowCaller) {
return { envelope: null, path: 'skipped', note: '無本地三元組且未提供萃取模型 → 跳過' };
}
const ex = await extract(file.text, opts.shallowCaller, opts.deepCaller);
const envelope = buildEnvelope({
source: { uri: file.uri, content_hash: file.content_hash, commit: file.commit },
extractor: { model: ex.model, tier: ex.tier },
nodes: ex.nodes,
triplets: ex.triplets,
});
return { envelope, path: 'extract', note: ex.escalated ? '淺萃失敗 → 升 deep' : undefined };
}
+108
View File
@@ -0,0 +1,108 @@
// T1 SourceAdapter — 從 GitHub 拉 repo 的 MD 檔 + per-file content-hash。
//
// 鐵律:runtime 用 GitHub API 拉 repo(不開 Actions、不掛 webhook 自動同步)。
// 拉是 runtime 行為(人/refresh 發起的一次調用),不衝突 flag 紅線。
// source.uri = 'github:<owner>/<repo>@<path>'(穩定識別 = 快照鍵 + get_source 指標)。
export interface SourceFile {
/** github:owner/repo@path */
uri: string;
/** 檔內相對路徑(owner/repo 之外的部分)。 */
path: string;
/** 原始檔內容(UTF-8)。 */
text: string;
/** content_hashsha256 hex,快照鍵)。 */
content_hash: string;
/** git commit sha(可追溯,選填)。 */
commit?: string;
}
/** sha256 hex —— Workers 與 Node 18+ 皆有 crypto.subtle。 */
export async function contentHash(text: string): Promise<string> {
const data = new TextEncoder().encode(text);
const digest = await crypto.subtle.digest('SHA-256', data);
return [...new Uint8Array(digest)].map((b) => b.toString(16).padStart(2, '0')).join('');
}
/** 組 source.uri(單一真相格式,全程經此函式產,避免拼錯)。 */
export function makeSourceUri(owner: string, repo: string, path: string): string {
return `github:${owner}/${repo}@${path}`;
}
/** 解析 source.uri 回 {owner, repo, path}。null = 格式不符。 */
export function parseSourceUri(uri: string): { owner: string; repo: string; path: string } | null {
const m = /^github:([^/]+)\/([^@]+)@(.+)$/.exec(uri);
if (!m) return null;
return { owner: m[1], repo: m[2], path: m[3] };
}
export interface GitHubFetcher {
/** 列出 repo 內某路徑下的 MD 檔(遞迴)。回傳檔路徑 list。 */
listMarkdown(owner: string, repo: string, root?: string): Promise<string[]>;
/** 取單檔原文 + commit sha。 */
getFile(owner: string, repo: string, path: string): Promise<{ text: string; commit?: string }>;
}
/**
* GitHub API fetcherruntime Actions
* token GITHUB_TOKEN mock
*/
export function makeGitHubFetcher(token?: string, fetchImpl: typeof fetch = fetch): GitHubFetcher {
const headers: Record<string, string> = {
Accept: 'application/vnd.github+json',
'User-Agent': 'kbdb-ingest-plugin',
};
if (token) headers.Authorization = `Bearer ${token}`;
const api = 'https://api.github.com';
return {
async listMarkdown(owner, repo, root = '') {
// git/trees 遞迴:一次 API call 拿整棵樹(避免逐目錄 fan-out 流量)。
const res = await fetchImpl(`${api}/repos/${owner}/${repo}/git/trees/HEAD?recursive=1`, { headers });
if (!res.ok) throw new Error(`[github] list ${owner}/${repo}: ${res.status} ${res.statusText}`);
const body = (await res.json()) as { tree?: Array<{ path: string; type: string }> };
const prefix = root.replace(/^\/+|\/+$/g, '');
return (body.tree ?? [])
.filter((e) => e.type === 'blob' && e.path.endsWith('.md'))
.map((e) => e.path)
.filter((p) => (prefix ? p === prefix || p.startsWith(prefix + '/') : true));
},
async getFile(owner, repo, path) {
const res = await fetchImpl(`${api}/repos/${owner}/${repo}/contents/${encodeURIComponent(path).replace(/%2F/g, '/')}`, { headers });
if (!res.ok) throw new Error(`[github] get ${owner}/${repo}@${path}: ${res.status} ${res.statusText}`);
const body = (await res.json()) as { content?: string; encoding?: string; sha?: string };
const text = body.encoding === 'base64' && body.content ? decodeBase64Utf8(body.content) : (body.content ?? '');
return { text, commit: body.sha };
},
};
}
function decodeBase64Utf8(b64: string): string {
const clean = b64.replace(/\n/g, '');
const bin = atob(clean);
const bytes = Uint8Array.from(bin, (c) => c.charCodeAt(0));
return new TextDecoder('utf-8').decode(bytes);
}
/** 拉一個 repo 路徑下所有 MD → SourceFile[](含 content_hash)。 */
export async function pullRepoMarkdown(
fetcher: GitHubFetcher,
owner: string,
repo: string,
root = '',
): Promise<SourceFile[]> {
const paths = await fetcher.listMarkdown(owner, repo, root);
const out: SourceFile[] = [];
for (const path of paths) {
const { text, commit } = await fetcher.getFile(owner, repo, path);
out.push({
uri: makeSourceUri(owner, repo, path),
path,
text,
content_hash: await contentHash(text),
commit,
});
}
return out;
}
BIN
View File
Binary file not shown.
+85
View File
@@ -0,0 +1,85 @@
// 共用型別 + envelope 契約鏡射(contracts/ingest-candidate.jsonfull 版含向量化打標)。
//
// 鐵律:ingest 純餵食器,只【打標】embed/predicate_embed + 帶 gloss/aliases
// 實際 embedding 歸 base/KBDB embed 模組讀標執行。ingest 自己不算向量。
// envelope 是 ingest↔graph 唯一耦合面(三守則:凍結契約)。
import { z } from '@hono/zod-openapi';
export interface Bindings {
ENVIRONMENT?: string;
/** graph 寫入端 base URL;空 = 未部署,POST 時誠實報 not-configured,不假綠。 */
GRAPH_BASE_URL?: string;
/** 萃取預設 tier 意圖(shallow=Haikudeep=Claude via CC)。 */
DEFAULT_EXTRACT_TIER?: 'shallow' | 'deep';
/** 拉 GitHub 私庫用(公庫可空)。走 secret put。 */
GITHUB_TOKEN?: string;
/** graph 寫入端 bearer(對應 graph 的 KBDB_INTERNAL_TOKEN)。走 secret put。 */
GRAPH_INTERNAL_TOKEN?: string;
}
export interface Variables {
partner_id: string;
}
// ── envelope 契約(full:含 ingest#1 升格的向量化打標欄位)──────────────
// graph 收件端 .strict() 追上 contractgraph#1 補對齊任務)後即收得下這些欄位。
export const EnvelopeNodeSchema = z
.object({
name: z.string().min(1),
/** 去重鍵:wikilink 卡用檔名(一卡一 node,不以出現次數重複 embed);實體用正規名。 */
id: z.string().optional(),
/** 一句話描述。base embed【名+gloss 一起】拉近同義詞。建議 deep tier 產。 */
gloss: z.string().optional(),
/** 同義詞(黃仁勳/Jensen Huang)。base 歸一成同一 node。 */
aliases: z.array(z.string()).optional(),
/** 向量化打標:此 node 要不要進向量庫。預設 true。ingest 打標,base 讀標執行。 */
embed: z.boolean().optional(),
entity_type: z.enum(['person', 'event', 'product', 'market', 'org']).optional(),
})
.strict();
export const EnvelopeEdgeSchema = z
.object({
subject: z.string().min(1),
predicate: z.string().min(1),
object: z.string().min(1),
/** 謂詞向量化打標(裸詞 embed,無描述)→ predicate_vector,支援關係過濾。預設 true。 */
predicate_embed: z.boolean().optional(),
confidence: z.number().min(0).max(1).optional(),
})
.strict();
export const EnvelopeSchema = z
.object({
source: z
.object({
/** 'github:<owner>/<repo>@<path>'= 快照鍵 + get_source 指標。 */
uri: z.string().min(1),
/** 來源檔內容 hash(快照鍵)。graph 比對同 hash → no-op。 */
content_hash: z.string().min(1),
anchor: z.string().optional(),
commit: z.string().optional(),
block_id: z.string().optional(),
})
.strict(),
extractor: z
.object({
model: z.string().min(1),
tier: z.enum(['shallow', 'deep']),
extracted_at: z.number().int().optional(),
})
.strict(),
nodes: z.array(EnvelopeNodeSchema).optional(),
triplets: z.array(EnvelopeEdgeSchema).min(1),
})
.strict();
export type EnvelopeNode = z.infer<typeof EnvelopeNodeSchema>;
export type EnvelopeEdge = z.infer<typeof EnvelopeEdgeSchema>;
export type Envelope = z.infer<typeof EnvelopeSchema>;
/** graph 領域欄位 — ingest 絕不可送(送了被 graph 422)。用於本地自檢,提早攔。 */
export const FORBIDDEN_TOP_KEYS = ['id', 'clusters', 'bridge_score', 'created_at', 'updated_at'] as const;
export const FORBIDDEN_EDGE_KEYS = ['subject_entity_type', 'object_entity_type'] as const;
+47
View File
@@ -0,0 +1,47 @@
import { describe, it, expect } from 'vitest';
import { buildEnvelope } from '../src/lib/envelope';
const base = {
source: { uri: 'github:o/r@a.md', content_hash: 'abc' },
extractor: { model: 'local-harvest', tier: 'shallow' as const },
triplets: [{ subject: 'A', predicate: 'p', object: 'B', predicate_embed: true }],
};
describe('buildEnvelope', () => {
it('組合法 envelope(含向量化打標欄位)', () => {
const env = buildEnvelope({
...base,
nodes: [{ name: 'A', gloss: 'a', aliases: ['a2'], embed: true, id: 'A' }],
});
expect(env.source.uri).toBe('github:o/r@a.md');
expect(env.nodes?.[0].embed).toBe(true);
expect(env.nodes?.[0].id).toBe('A');
expect(env.triplets[0].predicate_embed).toBe(true);
});
it('node 帶禁送欄位(bridge_score)→ strict throw(本地提早攔,不等 graph 422', () => {
expect(() => buildEnvelope({ ...base, nodes: [{ name: 'A', embed: true }] })).not.toThrow();
expect(() =>
buildEnvelope({ ...base, nodes: [{ name: 'A', bridge_score: 0.5 } as any] }),
).toThrow();
});
it('node 帶 graph 領域 record id(非去重 id)以外的禁送鍵 → strict throw', () => {
// 契約允許 nodes[].id(去重鍵);但 clusters 是 graph 領域 → strict 擋。
expect(() => buildEnvelope({ ...base, nodes: [{ name: 'A', id: 'A', embed: true }] })).not.toThrow();
expect(() => buildEnvelope({ ...base, nodes: [{ name: 'A', clusters: ['c'] } as any] })).toThrow();
});
it('禁送邊上 entity_type → strict throw', () => {
expect(() =>
buildEnvelope({
...base,
triplets: [{ subject: 'A', predicate: 'p', object: 'B', subject_entity_type: 'person' } as any],
}),
).toThrow();
});
it('無 triplets → throw(契約 min 1', () => {
expect(() => buildEnvelope({ ...base, triplets: [] })).toThrow();
});
});
+58
View File
@@ -0,0 +1,58 @@
import { describe, it, expect } from 'vitest';
import { extract, parseExtractJson, type LlmCaller } from '../src/lib/extract';
const GOOD_JSON = JSON.stringify({
nodes: [
{ name: '原子筆記', gloss: '一個不可再分論點的記錄單元' },
{ name: '傳統筆記', gloss: '多主題混雜的記錄' },
],
triplets: [{ subject: '原子筆記', predicate: '對立於', object: '傳統筆記', confidence: 0.9 }],
});
function caller(model: string, out: string | (() => Promise<string>)): LlmCaller {
return { model, call: typeof out === 'string' ? async () => out : out };
}
describe('parseExtractJson', () => {
it('解析 fenced JSON + 打標 embed/predicate_embed', () => {
const g = parseExtractJson('```json\n' + GOOD_JSON + '\n```');
expect(g.triplets[0].predicate_embed).toBe(true);
expect(g.nodes[0].embed).toBe(true);
expect(g.triplets[0].confidence).toBe(0.9);
});
it('無 triplets → throw', () => {
expect(() => parseExtractJson(JSON.stringify({ nodes: [], triplets: [] }))).toThrow();
});
});
describe('extract', () => {
it('淺萃成功不升級', async () => {
const r = await extract('原文', caller('haiku', GOOD_JSON));
expect(r.tier).toBe('shallow');
expect(r.escalated).toBe(false);
expect(r.model).toBe('haiku');
});
it('淺萃 JSON-fail → 升 deep(升級閘)', async () => {
const r = await extract('原文', caller('haiku', 'not json at all'), caller('claude', GOOD_JSON));
expect(r.escalated).toBe(true);
expect(r.tier).toBe('deep');
expect(r.model).toBe('claude');
expect(r.triplets.length).toBe(1);
});
it('淺萃失敗且無 deep caller → throw', async () => {
await expect(extract('原文', caller('haiku', 'garbage'))).rejects.toThrow();
});
it('端點對齊護欄:模型吐對不齊端點 → 自動補進 nodes', async () => {
const skewed = JSON.stringify({
nodes: [{ name: 'A' }],
triplets: [{ subject: 'A', predicate: '連到', object: 'B(沒在 nodes' }],
});
const r = await extract('原文', caller('haiku', skewed));
// B 被自動補成 node → 端點全對齊
expect(r.nodes.some((n) => n.name === 'B(沒在 nodes')).toBe(true);
});
});
+43
View File
@@ -0,0 +1,43 @@
import { describe, it, expect } from 'vitest';
import { makeGraphClient } from '../src/lib/graph-client';
import type { Envelope } from '../src/types';
const env: Envelope = {
source: { uri: 'github:o/r@a.md', content_hash: 'abc' },
extractor: { model: 'local-harvest', tier: 'shallow' },
triplets: [{ subject: 'A', predicate: 'p', object: 'B' }],
};
function mockFetch(status: number, body: unknown): typeof fetch {
return (async () =>
new Response(JSON.stringify(body), { status, headers: { 'Content-Type': 'application/json' } })) as any;
}
describe('makeGraphClient', () => {
it('GRAPH_BASE_URL 未設 → 誠實回 ok:false,不假綠、不打網路', async () => {
let called = false;
const client = makeGraphClient(undefined, undefined, (async () => {
called = true;
return new Response('{}');
}) as any);
const r = await client.postEnvelope(env);
expect(r.ok).toBe(false);
expect(r.error).toContain('未設');
expect(called).toBe(false);
});
it('200 → ok + 帶 graph 回的 {skipped,ingested,deprecated}', async () => {
const client = makeGraphClient('https://graph.example', 'tok', mockFetch(200, { skipped: false, ingested: 1, deprecated: 0 }));
const r = await client.postEnvelope(env);
expect(r.ok).toBe(true);
expect((r.body as any).ingested).toBe(1);
});
it('422 → ok:false 帶 issues(供修禁送欄位)', async () => {
const client = makeGraphClient('https://graph.example', undefined, mockFetch(422, { error: 'invalid envelope', issues: [{ path: ['bridge_score'] }] }));
const r = await client.postEnvelope(env);
expect(r.ok).toBe(false);
expect(r.status).toBe(422);
expect((r.body as any).issues).toBeDefined();
});
});
+68
View File
@@ -0,0 +1,68 @@
import { describe, it, expect } from 'vitest';
import { harvestCard, parseEntities, parseEdges, parseFrontmatter } from '../src/lib/harvest';
const CARD = `---
tags: [, ]
gloss: ingest KBDB
---
#
[[ingest/00-INDEX]]
##
KBDB
##
- **kbdb-ingest-plugin** POST
- **base KBDB**arcrun/kbdb
##
###
- kbdb-ingest-plugin >> >> base KBDB
###
- [[]] >> >> [[envelope-]]
`;
describe('parseFrontmatter', () => {
it('抽出 gloss', () => {
const { fm, body } = parseFrontmatter(CARD);
expect(fm.gloss).toBe('ingest 在 KBDB 堆疊裡的位置。');
expect(body).toContain('# 掛載架構');
});
});
describe('parseEntities', () => {
it('解析正規名 + aliases + gloss', () => {
const { body } = parseFrontmatter(CARD);
const nodes = parseEntities(body);
expect(nodes.map((n) => n.name)).toEqual(['kbdb-ingest-plugin', 'base KBDB']);
expect(nodes[1].aliases).toEqual(['arcrun/kbdb', '基本盤']);
expect(nodes[0].gloss).toBe('最薄一層,純 POST 候選。');
expect(nodes[0].embed).toBe(true);
});
});
describe('parseEdges', () => {
it('解析 typed-edge、去 [[ ]]、標記卡對卡', () => {
const { body } = parseFrontmatter(CARD);
const edges = parseEdges(body);
expect(edges).toContainEqual({ subject: 'kbdb-ingest-plugin', predicate: '掛載於', object: 'base KBDB', predicate_embed: true, subjectIsCard: false, objectIsCard: false });
expect(edges).toContainEqual({ subject: '掛載架構', predicate: '受約束於', object: 'envelope-契約', predicate_embed: true, subjectIsCard: true, objectIsCard: true });
});
});
describe('harvestCard', () => {
it('卡標題 node 帶 frontmatter gloss、含內文 node', () => {
const r = harvestCard(CARD);
const titleNode = r.nodes.find((n) => n.name === '掛載架構');
expect(titleNode?.gloss).toBe('ingest 在 KBDB 堆疊裡的位置。');
expect(r.nodes.some((n) => n.name === 'base KBDB')).toBe(true);
expect(r.triplets.length).toBe(2);
});
it('內文端點對齊(無對不齊)', () => {
const r = harvestCard(CARD);
// kbdb-ingest-plugin / base KBDB 都在 ## 實體;卡對卡端點不要求
expect(r.unalignedEndpoints).toEqual([]);
});
});
+73
View File
@@ -0,0 +1,73 @@
import { describe, it, expect } from 'vitest';
import { makeSourceUri, parseSourceUri, contentHash, pullRepoMarkdown, type GitHubFetcher } from '../src/lib/source-adapter';
import { processSource } from '../src/lib/pipeline';
import type { LlmCaller } from '../src/lib/extract';
describe('source-adapter uri', () => {
it('makeSourceUri / parseSourceUri round-trip', () => {
const uri = makeSourceUri('uncle6me-web', 'kbdb-ingest-plugin', 'system-dev/wiki/cards/ingest/掛載架構.md');
expect(uri).toBe('github:uncle6me-web/kbdb-ingest-plugin@system-dev/wiki/cards/ingest/掛載架構.md');
expect(parseSourceUri(uri)).toEqual({
owner: 'uncle6me-web',
repo: 'kbdb-ingest-plugin',
path: 'system-dev/wiki/cards/ingest/掛載架構.md',
});
});
it('content-hash 穩定且隨內容變', async () => {
const a = await contentHash('hello');
expect(a).toBe(await contentHash('hello'));
expect(a).not.toBe(await contentHash('world'));
});
});
const HARVEST_CARD = `---
gloss: 卡標題定義
---
# A
##
- ****
- ****
##
- >> >>
`;
function mockFetcher(files: Record<string, string>): GitHubFetcher {
return {
async listMarkdown() {
return Object.keys(files);
},
async getFile(_o, _r, path) {
return { text: files[path], commit: 'sha1' };
},
};
}
describe('pullRepoMarkdown + processSource', () => {
it('採取路徑 A:拉檔 → harvest → envelope(不 extract', async () => {
const sources = await pullRepoMarkdown(mockFetcher({ 'cards/a.md': HARVEST_CARD }), 'o', 'r');
expect(sources.length).toBe(1);
const result = await processSource(sources[0]);
expect(result.path).toBe('harvest');
expect(result.envelope?.triplets).toEqual([{ subject: '甲', predicate: '連到', object: '乙', predicate_embed: true }]);
expect(result.envelope?.extractor.model).toBe('local-harvest');
});
it('採不到三元組 + 無萃取模型 → skipped(不假萃)', async () => {
const sources = await pullRepoMarkdown(mockFetcher({ 'plain.md': '# 純文字\n沒有三元組。' }), 'o', 'r');
const result = await processSource(sources[0]);
expect(result.path).toBe('skipped');
expect(result.envelope).toBeNull();
});
it('採不到 → fallback extract(路徑 B', async () => {
const caller: LlmCaller = {
model: 'haiku',
call: async () => JSON.stringify({ nodes: [{ name: '甲' }], triplets: [{ subject: '甲', predicate: '是', object: '乙' }] }),
};
const sources = await pullRepoMarkdown(mockFetcher({ 'plain.md': '# 純文字\n甲是乙。' }), 'o', 'r');
const result = await processSource(sources[0], { shallowCaller: caller });
expect(result.path).toBe('extract');
expect(result.envelope?.extractor.model).toBe('haiku');
});
});
+45
View File
@@ -0,0 +1,45 @@
import { describe, it, expect } from 'vitest';
import { weave, flattenForPost, type RepoEnvelopes } from '../src/lib/weave';
import type { Envelope } from '../src/types';
function env(uri: string, nodes: string[], triplets: Array<[string, string, string]>): Envelope {
return {
source: { uri, content_hash: uri },
extractor: { model: 'local-harvest', tier: 'shallow' },
nodes: nodes.map((n) => ({ name: n, embed: true })),
triplets: triplets.map(([s, p, o]) => ({ subject: s, predicate: p, object: o })),
};
}
const repos: RepoEnvelopes[] = [
{ repo: 'o/repoA', envelopes: [env('github:o/repoA@x.md', ['Arcrun', '餵食器'], [['Arcrun', '包含', '餵食器']])] },
{ repo: 'o/repoB', envelopes: [env('github:o/repoB@y.md', ['Arcrun', '圖層'], [['Arcrun', '依賴', '圖層']])] },
];
describe('weave', () => {
it('偵測跨庫橋(同名節點跨 ≥2 repo)', () => {
const r = weave(repos);
const bridge = r.bridges.find((b) => b.node === 'Arcrun');
expect(bridge?.repos).toEqual(['o/repoA', 'o/repoB']);
expect(r.totalTriplets).toBe(2);
});
it('偵測跨庫異見(同 s/o 對、不同謂詞跨 repo', () => {
const diverge: RepoEnvelopes[] = [
{ repo: 'o/repoA', envelopes: [env('github:o/repoA@x.md', ['X', 'Y'], [['X', '支持', 'Y']])] },
{ repo: 'o/repoB', envelopes: [env('github:o/repoB@y.md', ['X', 'Y'], [['X', '反對', 'Y']])] },
];
const r = weave(diverge);
expect(r.divergences.length).toBe(1);
expect(r.divergences[0].predicatesByRepo.map((p) => p.predicate).sort()).toEqual(['反對', '支持']);
});
it('flattenForPost 攤平所有 envelope(順序穩定)', () => {
expect(flattenForPost(repos).length).toBe(2);
});
it('ingest 不算 bridge_score(橋只標 repos,無分數欄位)', () => {
const r = weave(repos);
expect(r.bridges[0]).not.toHaveProperty('bridge_score');
});
});
+16
View File
@@ -0,0 +1,16 @@
{
"compilerOptions": {
"target": "ESNext",
"module": "ESNext",
"moduleResolution": "bundler",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"outDir": "dist",
"rootDir": "src",
"types": ["@cloudflare/workers-types"]
},
"include": ["src/**/*.ts"],
"exclude": ["node_modules", "dist", "tests"]
}
+9
View File
@@ -0,0 +1,9 @@
import { defineConfig } from 'vitest/config';
// ingest 純餵食器:不綁 D1/Vectorize/AI。測試走純 node + mockfetch / graph client)。
export default defineConfig({
test: {
environment: 'node',
include: ['tests/**/*.test.ts'],
},
});
+24
View File
@@ -0,0 +1,24 @@
name = "kbdb-ingest-plugin"
main = "src/index.ts"
compatibility_date = "2025-02-19"
compatibility_flags = ["nodejs_compat"]
workers_dev = true
# KBDB-ingest 插件 = 純餵食器:GitHub 拉 + 採取/萃取 + 跨庫織網 → POST envelope 給 graph。
# 鐵律:不碰儲存(無 D1/Vectorize/AI 綁定——那些屬 base/graphingest 不直連)。
# 部署走 wrangler,繞 GitHub Actions(被 flag 教訓)。
[vars]
ENVIRONMENT = "development"
# graph 插件寫入端 base URLPOST {GRAPH_BASE_URL}/triplets/ingest)。
# 部署前用 `wrangler secret put` 或在此填,例如 https://kbdb-graph.<acct>.workers.dev
GRAPH_BASE_URL = ""
# 萃取(路徑 B)預設模型意圖。"shallow"=Haiku/Workers AI"deep"=Claude via CC。
DEFAULT_EXTRACT_TIER = "shallow"
[alias]
"zod/v3" = "zod"
"zod/v4" = "zod"
"zod/v4-mini" = "zod"
# GITHUB_TOKEN / GRAPH_INTERNAL_TOKEN / ANTHROPIC 等機敏值走 `wrangler secret put`,不寫這裡。