feat(ingest): T0.5–T5 純餵食器管線實作(issue #2)
ingest 全管線(採取優先、extract fallback、跨庫織網、POST envelope): - T0.5 骨架:Hono + zod-openapi,無 D1/Vectorize/AI 綁定(不碰儲存鐵律) - T1 SourceAdapter:GitHub runtime API 拉 + per-file sha256 content-hash + /refresh 受理端 - T2 採取(路徑 A 優先):harvest template 1.8.0+ 卡(gloss/實體/typed-edge) - T3 extract(路徑 B fallback):LlmCaller 可選模型 + JSON-fail 升級閘 + 端點對齊硬自檢護欄;第一版不 embed(只打標) - T4 跨庫織網(主職):匯總多 repo → 偵測跨庫橋/異見,不算 bridge_score(graph 領域) - T5 輸出:buildEnvelope strict + 顯式禁送欄位自檢;graph-client 純 POST(cherry-pick _kbdb_client.py 改不碰 base);薄 ops CLI(不帶查詢 MCP) envelope 對齊 full contract(embed/id/aliases/predicate_embed);同步 contract 向量化欄位升格。 gate:vitest 28 passed / tsc clean / wrangler dry-run 乾淨(只 env-var 綁定)。 端到端 ingest→graph:graph receiver 已補對齊 → 待 ingest 部署 + GRAPH_BASE_URL → 待部署驗,未假綠。 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -60,7 +60,7 @@
|
||||
},
|
||||
"nodes": {
|
||||
"type": "array",
|
||||
"description": "節點層附帶資訊(選填)。entity_type 與 gloss 是【節點】屬性,不是【邊】屬性 → 放這裡,不放 triplets。graph 用 gloss 去 embed(每節點一句,不是裸詞)、用 entity_type 去 typing。",
|
||||
"description": "節點層附帶資訊。【向量化分工(leo 2026-06-26,ingest#1 升格成契約)】ingest 在此【打標】哪些 token 要向量化 + embed 什麼;base/KBDB embed 模組【讀標執行】實際 embedding;ingest 自己不算向量。兩類節點(實體詞條 / wikilink 卡)都進 nodes[],謂詞向量見 triplets[].predicate_vector。",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["name"],
|
||||
@@ -69,11 +69,26 @@
|
||||
"name": {
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
"description": "節點名(須對應某 triplet 的 subject 或 object 原字面)。"
|
||||
"description": "節點名(須對應某 triplet 的 subject/object 原字面)。實體詞條=正規名;wikilink 卡=卡標題。"
|
||||
},
|
||||
"id": {
|
||||
"type": "string",
|
||||
"description": "去重鍵。wikilink 卡用【檔名】→ 一卡一 node,被多條邊指到也只 embed 一次,不以出現次數重複。實體詞條用正規名。選填(無則以 name 去重)。"
|
||||
},
|
||||
"gloss": {
|
||||
"type": "string",
|
||||
"description": "一句話描述,供 embedding。例如 'Graph RAG — 用關係遍歷檢索、保住異見的 RAG 變體'。選填(建議 deep tier 產出)。"
|
||||
"description": "一句話描述。base embed 對【名 + gloss 一起】embedding(實體同義詞字面差太遠,靠描述拉近)。選填(建議 deep tier 產)。"
|
||||
},
|
||||
"aliases": {
|
||||
"type": "array",
|
||||
"items": { "type": "string" },
|
||||
"description": "同義詞(如『黃仁勳』/『Jensen Huang』)。base 歸一(collapse)成同一 node。選填。"
|
||||
},
|
||||
"embed": {
|
||||
"type": "boolean",
|
||||
"default": true,
|
||||
"description": "【向量化打標】此節點要不要進向量庫。true=base 讀標去 embed(名+gloss);false=base 看到就不理(如結構符號/散文不該進 nodes[],真進了標 false)。預設 true(實體詞條與 wikilink 卡都要)。",
|
||||
"$comment": "ingest 打標,base 讀標執行。embed 動作歸 base embed 模組,ingest 不算向量。"
|
||||
},
|
||||
"entity_type": {
|
||||
"type": "string",
|
||||
@@ -86,7 +101,7 @@
|
||||
"triplets": {
|
||||
"type": "array",
|
||||
"minItems": 1,
|
||||
"description": "邊(關係)。ingest 只產原始 (s,p,o) + confidence。",
|
||||
"description": "邊(關係)。ingest 只產原始 (s,p,o) + confidence + 謂詞向量打標。端點(s/o)以字面 match nodes[].name。",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["subject", "predicate", "object"],
|
||||
@@ -95,10 +110,16 @@
|
||||
"subject": { "type": "string", "minLength": 1, "description": "主詞(實體名,須與 nodes[].name 對得上若有提供)" },
|
||||
"predicate": { "type": "string", "minLength": 1, "description": "謂詞(關係)" },
|
||||
"object": { "type": "string", "minLength": 1, "description": "受詞(目標實體或值)" },
|
||||
"predicate_embed": {
|
||||
"type": "boolean",
|
||||
"default": true,
|
||||
"description": "【謂詞向量化打標】謂詞要不要 embed。base 讀標 → embed【謂詞裸詞,無描述】(謂詞同義詞字面本就近,如『參考』/『參照』,裸詞 embed 即自動聚類),存 edge 的 predicate_vector。為支援『關係過濾』查詢(查『參考』不漏『參照』)→ 預設 true。embed 動作歸 base,ingest 只打標。",
|
||||
"$comment": "ingest 打標,base 讀標執行 embed。"
|
||||
},
|
||||
"confidence":{ "type": "number", "minimum": 0, "maximum": 1, "default": 1.0, "description": "萃取可信度。淺萃可附自評;graph 不據此過濾,只記錄。" }
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"$comment": "禁止欄位(graph 領域,ingest 絕不可送): id / clusters / bridge_score / created_at / updated_at / 以及 triplet 上的 subject_entity_type|object_entity_type(類型只走 nodes[])。送了即違反 ingest=純餵食器的邊界,graph 應拒收或忽略。"
|
||||
"$comment": "禁止欄位(graph 領域,ingest 絕不可送): id(節點去重鍵的 id 例外,那是 ingest 提供的去重鍵非 record id) / clusters / bridge_score / created_at / updated_at / 以及 triplet 上的 subject_entity_type|object_entity_type(類型只走 nodes[])。【向量化分工】ingest 打標(embed/predicate_embed + 帶 gloss/aliases),base/KBDB embed 模組讀標執行 embedding,ingest 不算向量。結構符號(>>/←)與給人讀的散文(## 摘要)不進 envelope。"
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user