arcrun — AI workflow execution engine (clean history)

Self-hosted 開源：WASM 零件 + recipe + cypher-executor，跑在你自己的 Cloudflare。此為重建的乾淨歷史起點（移除曾誤 commit 的 GCP SA 金鑰，舊歷史保留在 richblack/arcrun 與本地 backup 分支）。含： - acr init --self-hosted installer（建 KV/R2 + codeload 拉預編譯 wasm + wrangler deploy + seed recipe） - recipe push 把關（資料外流提醒 + 打通檢查） - 19 個正當零件預編譯 wasm（claude_api/km_writer/kbdb_upsert_block 排除：違反 DECISIONS §1） - CLI / cypher-executor / registry / 完整 SDD Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 15:52:38 +08:00
commit 922a57fe34
485 changed files with 89356 additions and 0 deletions
@@ -0,0 +1,285 @@
+# SDD: Resumable Workflow（webhook callback 喚醒）
+
+> 2026-05-07 建立。狗糧寫 wiki 合成 workflow 時，Mira daemon 對長草稿（>2KB）切非同步模式回 `{pending, task_id, poll_url}`，cypher-executor 沒處理就直接傳下游。
+> 本 SDD 解這層：**workflow 跑到一半遇到 pending 任務 → 暫停 + 持久化狀態 → 外部 callback 進來時喚醒繼續**。
+> 範圍：兩家自家服務之間（Mira daemon ↔ cypher-executor）走 webhook 推。對外服務無 webhook 的場景留 wishlist 用 poll 解。
+
+---
+
+## 1. 問題
+
+### 1.1 撞牆現場
+
+wiki 合成 workflow 第一節點 `claude_api(recipe:wiki_synthesis)`：
+- 短草稿（< 2KB）→ daemon 同步回 `{success, data: {text}}`，recipe output parser 解 JSON 成功
+- 長草稿（> 2KB）→ daemon 估 75s，切非同步模式回：
+
+```json
+{
+  "success": true,
+  "pending": true,
+  "task_id": "task_14_1778133152480",
+  "poll_url": "https://mira.uncle6.me/mira/execute/task_14_1778133152480",
+  "estimated_seconds": 75
+}
+```
+
+cypher-executor 拿到這個物件就當 result，但裡面沒 `data.text`，下游 recipe output parser 找不到要 parse 的東西，整個 workflow 算「success」但實際上 wiki 還沒生出來。
+
+### 1.2 現有 toolkit 不夠
+
+- `wait` 零件：固定 sleep N ms，沒 retry / 條件判斷
+- `http_request` 零件：通用 HTTP，不認 daemon 的 polling 協議
+- cypher-executor `visited` Set：擋住節點重訪，沒辦法做迴圈式 poll
+- Worker CPU 30s 限制：同步 poll 75s 任務不可能
+
+### 1.3 Push vs Pull 抉擇（2026-05-07 拍板）
+
+| | Webhook 推 | Poll 拉 |
+|---|---|---|
+| 適用 | 雙方都自家 | 對方無 callback 能力 |
+| Worker 時間消耗 | 趨近 0 | 全程占用 |
+| 時長限制 | 無 | Worker CPU 30s |
+| 工程位置 | runtime 能力（cypher-executor）| 零件（poll_task） |
+
+**走 Webhook 推**（自家服務優先，poll_task 進 wishlist）。
+
+---
+
+## 2. 設計
+
+### 2.1 三層改動
+
+**A. Mira daemon 端（infra/cloud-cto）**
+- `/mira/execute` 接受新欄位 `callback_url: string`（optional）
+- task 完成時 POST 到 `callback_url`，body：
+  ```json
+  {
+    "task_id": "task_14_xxx",
+    "success": true,
+    "data": { "text": "..." }
+  }
+  ```
+- 失敗也要 callback，body 含 `error` 欄位
+- 重試策略：3 次 backoff（1s / 5s / 30s），最後失敗就放棄（task 狀態存進 daemon 自己 KV）
+
+**B. cypher-executor 端（resumable runtime）**
+
+新概念：**workflow run 可以暫停**。
+
+設計：
+1. 新 KV namespace（或用既有 `EXEC_CONTEXT`）存暫停的 run state：
+   - key: `paused_run:{task_id}` 或 `paused_run:{run_id}`
+   - value: `{ run_id, graph, paused_node_id, paused_node_pending_result, context, trace_so_far, kv_store_ref, expires_at }`
+2. graph-executor 偵測節點 result 含 `pending: true` + `task_id` → 暫停 + 寫 KV + 回 `{paused: true, task_id, run_id}`
+3. 新 endpoint `POST /workflows/resume`：
+   - body: `{ task_id, result }`（result 是 daemon callback 給的完整資料）
+   - 從 KV 拿 paused state → merge result 進 paused_node 的 output → 從下個節點繼續執行
+4. claude_api 容器呼叫 daemon 時自動帶 `callback_url`：
+   - `https://cypher.arcrun.dev/workflows/resume?task_id={預先派發的 task_id}`
+   - 但 task_id 是 daemon 自己派的，cypher-executor 不知道。需先 daemon 派完 task_id 才能組 URL
+   - 解：daemon 改成「先回 task_id，再啟動實際工作 + 完成時 callback」— 兩階段 hand-shake
+
+實際流程（兩階段）：
+
+```
+cypher-executor                 Mira daemon
+       │                              │
+       │ POST /mira/execute           │
+       │ { prompt,                    │
+       │   callback_url: "?run_id=R1" }
+       ├─────────────────────────────>│
+       │                              │ 立即回 task_id（決定走非同步）
+       │<─────────────────────────────┤ { pending, task_id: T9 }
+       │                              │
+       ├─ 看到 pending → 寫 KV         │ 啟動實際 LLM 任務
+       │  paused_run:T9 = {run R1,    │
+       │  paused_node, ctx, ...}      │
+       │                              │
+       │ 立即回 client (MCP)：         │
+       │ { paused, task_id: T9 }       │
+       │                              │
+       ⋯⋯⋯⋯⋯ 75s 後 ⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯
+       │                              │ task done
+       │ POST /workflows/resume       │
+       │ { task_id: T9, result: {...} }
+       │<─────────────────────────────┤
+       │                              │
+       │ 從 KV 取 paused_run:T9       │
+       │ → merge result 進 paused 節點 │
+       │ → 從下個節點繼續              │
+       │                              │
+       │ run 跑完 → 寫 trace          │
+       │ → 通知 client (?)            │
+       │                              │
+```
+
+### 2.2 範圍邊界
+
+**第一版（v1）做：**
+- ✅ 單節點 pending → resume（最常見：claude_api 拿到 daemon pending）
+- ✅ daemon 加 callback_url 支援
+- ✅ cypher-executor `/workflows/resume` endpoint
+- ✅ run state 寫 EXEC_CONTEXT KV，含 24h TTL（避免 KV 累積）
+- ✅ 整合測：用 wiki 合成跑長草稿，驗 callback 進來能繼續
+
+**第一版不做：**
+- ❌ 多節點都 pending 的 nested 場景（例如 claude_api → 又一個 claude_api）— v2
+- ❌ foreach 內 pending（item-level resume）— v2
+- ❌ pending 期間用戶看到「進度」的前端 UI — 走 trace 有 paused 標記，前端 polling 自己做即可
+- ❌ pending callback 失敗時的 retry / DLQ — v2，先記 log
+
+**前置依賴：**
+- ✅ recipe-system 已部署（cypher-executor 已會解 recipe）
+- ✅ Mira daemon 在 Hetzner，可改 code
+
+### 2.3 為什麼不用 Cloudflare Queues / Durable Objects
+
+- **CF Queues**：適合大量 fan-out，這裡是點對點 callback，KV 已夠
+- **Durable Objects**：long-lived state 比 KV 強，但成本高 + 複雜
+- **EXEC_CONTEXT KV**：既有 binding，工程量最小
+
+未來真撞到 KV 限制（每 partner 寫入頻率上限）再升級。
+
+---
+
+## 3. 詳細設計
+
+### 3.1 daemon 端 callback 機制
+
+`infra/cloud-cto/index.js`（Mira daemon）：
+
+```js
+// /mira/execute handler
+{
+  // 既有 input + 新加：
+  callback_url: string  // optional
+}
+
+// 處理邏輯：
+// 1. 啟動 task（既有邏輯）
+// 2. 預估時間 > 30s → 切非同步：
+//    - 立即回 { success: true, pending: true, task_id, poll_url, estimated_seconds }
+//    - 背景 task 完成時：
+//      if (callback_url) POST callback_url with { task_id, success, data, error? }
+//      （不論用戶有沒有 poll，callback 一定會送）
+```
+
+callback 失敗策略：
+- 3 次重試（1s / 5s / 30s）
+- 全失敗：task 狀態維持完成，等 client 主動 poll（poll_url 仍有效）
+- 超過 24h 沒被消化的 task：daemon GC
+
+### 3.2 cypher-executor 端 resumable runtime
+
+#### 3.2.1 偵測 pending（graph-executor）
+
+在 Component case，runner 回傳後：
+
+```ts
+result = await runner(mergedContext);
+
+// 偵測 pending pattern（daemon 約定的回應結構）
+if (isResumablePending(result)) {
+  await persistPausedRun(this.env.EXEC_CONTEXT, taskIdFromResult(result), {
+    run_id, graph, paused_node_id: node.id, paused_context: context,
+    paused_result: result, trace_so_far: trace, expires_at: Date.now() + 24*60*60*1000
+  });
+  // 提早結束此 run，回 paused 狀態
+  return { paused: true, task_id, run_id };
+}
+
+// ... 既有的 recipe output parsing / kvSetNodeOutput / 等
+```
+
+`isResumablePending(result)` = `result?.pending === true && typeof result?.task_id === 'string'`
+
+#### 3.2.2 callback URL 注入（claude_api 之前的 layer）
+
+問題：claude_api 容器發 daemon 請求時，要帶 `callback_url`。但 task_id 是 daemon 派的，URL 裡只能放 run_id，daemon 收到 callback 時填 task_id：
+
+`callback_url = https://cypher.arcrun.dev/workflows/resume?run_id={current_run_id}`
+
+但 cypher-executor 端用 task_id 找 paused state（一個 run 可能多個 pending），所以 callback URL 應該是：
+
+`callback_url = https://cypher.arcrun.dev/workflows/resume`（不帶 query，task_id 在 body）
+
+**實作位置**：在 graph-executor 呼叫 claude_api 前，自動注入 `callback_url` 到 mergedContext：
+
+```ts
+if (node.componentId === 'claude_api' && this.env?.PUBLIC_BASE_URL) {
+  mergedContext.callback_url = `${this.env.PUBLIC_BASE_URL}/workflows/resume`;
+}
+```
+
+> 暫先用「componentId 寫死匹配」是 hacky，未來 component contract 加 `supports_async_callback: true` 標記就 generic 了。
+
+#### 3.2.3 resume endpoint
+
+`POST /workflows/resume`：
+
+```ts
+{
+  task_id: string,  // daemon 給的
+  success: boolean,
+  data?: { text: string },  // 跟同步呼叫一樣的結構
+  error?: string
+}
+```
+
+處理：
+1. 從 EXEC_CONTEXT KV `paused_run:{task_id}` 拿 state
+2. 沒拿到（過期 / 重複 callback）→ 回 200 + log
+3. 把 callback 給的 result 當作 paused_node 的 output
+4. 重建 GraphExecutor，從下個節點繼續執行
+5. 跑完寫完整 trace
+
+**問題：resume 後沒辦法再回給原 client。** 用戶最初打 `/cypher/execute`（同步），拿到 `{paused, task_id}` 之後就斷了；resume 跑完 result 沒地方送。
+
+**v1 解法**：resume 完寫進 `analytics_kv` 或 D1，**用戶要主動 query**。簡單但 UX 差。
+**v2 想法**：resume 完發另一個 webhook 給原 client（client 在 trigger 時帶 final_callback_url）。
+
+---
+
+## 4. 範圍
+
+**在本 SDD 範圍內：**
+- 4.1 daemon `/mira/execute` 加 callback_url 支援
+- 4.2 cypher-executor 偵測 pending + 持久化 paused state
+- 4.3 cypher-executor `/workflows/resume` endpoint
+- 4.4 callback_url 自動注入（claude_api 場景）
+- 4.5 wiki 合成 workflow 用長草稿端對端測試
+
+**不在本 SDD 範圍：**
+- nested pending（v2）
+- foreach 內 pending（v2）
+- final_callback 給原 client（v2）
+- poll_task 零件（wishlist）
+
+---
+
+## 5. 驗收標準
+
+1. wiki 合成 workflow 餵 5KB+ 草稿，跑完後 wiki page 有寫進 KBDB（不再 trace `pending` 假成功）
+2. trace 有 `paused` 紀錄，能看到 task_id
+3. 從 daemon 觸發 callback 後 < 5s 內 cypher-executor 把 paused state 撿起來繼續
+4. 24h 沒 callback 的 paused state KV 自動 expire（看 KV TTL 列表）
+
+---
+
+## 6. 風險
+
+| 風險 | 緩解 |
+|---|---|
+| daemon callback 進來時 cypher-executor 重啟 → state 還在 KV，OK | KV 持久化 |
+| 同 task_id 重複 callback（網路重試）→ 重複執行下游 | resume endpoint idempotent：拿到 state 後立刻刪 KV，重複 callback 找不到 state |
+| daemon callback 失敗（網路）| daemon 端 3 retry + 24h GC，超過就需手動干預（v1 接受） |
+| paused state 含敏感資料（partner key）| KV 有 24h TTL；不寫 plaintext secrets（既有 credential injection 在執行前才解，paused state 存的是執行前的 context，secret 還沒解）|
+
+---
+
+## 7. 變更紀錄
+
+| 版本 | 日期 | 內容 |
+|---|---|---|
+| v1.0 | 2026-05-07 | 初版。狗糧 wiki 合成撞 daemon 非同步 → 補 resumable workflow runtime。第一版只做單節點 pending + claude_api callback 注入。|
@@ -0,0 +1,61 @@
+# Tasks — Resumable Workflow
+
+> 對應 SDD：[design.md](design.md)
+> 上次更新：2026-05-07
+
+**狀態 legend**：`[ ]` 待辦 / `[🔄]` 進行中 / `[x]` 完成
+
+---
+
+## Phase 1：Mira daemon 端 callback 支援
+
+- [x] 1.1 改 `/opt/mira/mira-daemon.js`（Hetzner mira container）`/execute` 接受 `params.callback_url`
+- [x] 1.2 fireCallback function：task done/failed 時 POST callback_url，body = `{task_id, success, data?, error?}`
+- [x] 1.3 callback retry：4 次（立即 + 1s/5s/30s backoff），全失敗 log
+- [x] 1.4 patch script 寫好 `/tmp/patch-mira-daemon.py`，docker cp 進 container（注意：rebuild image 會丟失，需重 patch 或正式 commit 進 Dockerfile/git repo）
+- [x] 1.5 真實端對端驗證：daemon log 顯示 `[Mira callback] task=task_2_... POST https://cypher.arcrun.dev/workflows/resume OK 200`（2026-05-07 07:24:04 + task_3 短測試）
+
+## Phase 2：cypher-executor resumable runtime
+
+- [x] 2.1 寫 `paused-runs.ts`（81 行）：persistPausedRun / loadPausedRun / consumePausedRun + isResumablePending 偵測器，24h TTL
+- [x] 2.2 改 `graph-executor.ts` Component case：偵測 pending → 寫 KV + throw WorkflowPaused
+- [x] 2.3 改 `cypher-handlers.ts`：catch WorkflowPaused → 回 `{success:true, paused:true, task_id, run_id, paused_node_id, trace, graph}`
+- [x] 2.4 callback_url 自動注入：componentId==='claude_api' 時 mergedContext.callback_url = PUBLIC_BASE_URL 或預設 cypher.arcrun.dev/workflows/resume
+
+## Phase 3：resume endpoint
+
+- [x] 3.1 寫 `routes/resume.ts`：POST /workflows/resume，consumePausedRun → resumeFromPaused
+- [x] 3.2 graph-executor 加 `resumeFromPaused()` 方法：把 callback_result 當 paused_node 輸出 + spread 進 ctx + 從下游節點繼續
+- [x] 3.3 idempotent 驗證：第二次 callback 回 `{noop:true, reason:"state 不存在或過期"}`
+- [x] 3.4 cypher-executor 部署 v0580980b
+- [x] 3.5 mount /workflows/resume 進 index.ts
+
+## Phase 4：claude_api 容器透傳 callback_url
+
+- [x] 4.1 改 `claude_api/main.go`：Input 加 CallbackURL；timeout 預設改 120s
+- [x] 4.2 重 build wasm + redeploy claude-api.arcrun.dev (v f926e3dd)
+- [x] 4.3 真實端對端驗證：daemon 收到 callback_url → task done 後 POST cypher-executor/workflows/resume → 200 OK
+
+## Phase 5：端對端整合測試
+
+- [ ] 5.1 用 MCP `u6u_execute_workflow` 跑 wiki 合成 + 5KB+ 草稿
+- [ ] 5.2 第一次回應應為 `{paused, task_id, run_id}`
+- [ ] 5.3 等 daemon callback 進來（log 看到 /workflows/resume 命中）
+- [ ] 5.4 觀察 wiki page 真的寫進 KBDB（即使原 MCP call 已斷線）
+- [ ] 5.5 trace 含完整節點紀錄（paused → resumed）
+
+---
+
+## 風險追蹤
+
+- 風險 1：daemon callback 進來時，cypher.arcrun.dev 還沒醒（CF Worker cold start）→ 第一次 retry 接住（daemon retry policy 涵蓋）
+- 風險 2：v1 沒 final_callback 給原 client → 用戶要主動查狀態
+  - 接受：mira 河道 UI 可定期 refetch wiki page，或用既有 KBDB 觸發機制
+  - v2 加 final_callback 統一處理
+
+## v2 已記錄
+
+- nested pending（一個 run 多個 paused 節點）
+- foreach 內 pending（item-level resume）
+- final_callback 給原 client（trigger 時帶 final_callback_url）
+- poll_task 零件（外部 API 沒 webhook 時用）