{
  "base_url": "https://talk.nervos.org",
  "generated_at": "2026-04-29T02:53:08.560619+00:00",
  "since": "2026-04-28T02:53:04.351278+00:00",
  "until": "2026-04-29T02:53:04.351278+00:00",
  "window_hours": 24,
  "topics": [
    {
      "topic_id": 8572,
      "title": "TeamCKB Dev Log (Updated: Apr 29, 2026)",
      "slug": "teamckb-dev-log-updated-apr-29-2026",
      "url": "https://talk.nervos.org/t/teamckb-dev-log-updated-apr-29-2026/8572",
      "created_at": "2024-12-26T07:32:39.609000+00:00",
      "last_posted_at": "2026-04-29T02:01:11.417000+00:00",
      "category_id": 32,
      "tags": [
        "CKB",
        "CKB-VM"
      ],
      "posters": [
        "Original Poster, Most Recent Poster",
        "Frequent Poster",
        "Frequent Poster",
        "Frequent Poster",
        "Frequent Poster"
      ],
      "recent_posts": [
        {
          "post_id": 24063,
          "post_number": 35,
          "topic_id": 8572,
          "topic_title": "TeamCKB Dev Log (Updated: Apr 29, 2026)",
          "topic_slug": "teamckb-dev-log-updated-apr-29-2026",
          "author": "CKBdev",
          "created_at": "2026-04-29T02:01:11.417000+00:00",
          "updated_at": "2026-04-29T02:01:11.417000+00:00",
          "reply_to_post_number": 34,
          "url": "https://talk.nervos.org/t/teamckb-dev-log-updated-apr-29-2026/8572/35",
          "content_text": "Updates\nFeatures\nDifferential Testing Framework for ckb-vm\nImplemented a differential test framework for ckb-vm. This provides a stronger foundation for validating optimizations and future architecture changes: GitHub - yuqiliu617/ckb-vm-contrib at differential-test · GitHub\nCKB DAO Treasury Design & Voting Research\nresearch on the CKB DAO Treasury design and voting settlement, with multiple directions under active discussion:\nOption 1: DAO-Bound Voting with Off-Chain State\nExplores a governance model where voting power is bound to Nervos DAO deposits, while proposal state, voting records, and tally data are maintained off-chain and committed on-chain through verifiable state roots.\nThis direction focuses on improving scalability while keeping governance state verifiable.\nOption 2: Experimental MVP Implementation\nBuilt an experimental MVP to validate core DAO Treasury workflows and voting-related transaction construction. Details see: ckb/dao-treasury at dao-treasury · chenyukang/ckb · GitHub\nOption 3: Rollup-inspired Voting Settlement\nExplores a rollup-inspired approach for voting settlement, where voting data is batched and compressed into verifiable state updates.\nThis direction focuses on improving settlement efficiency while preserving on-chain verifiability under CKB’s UTXO model.\nKey Question Identified & Follow-Up Research\nHow to handle on-chain settlement for the voting process. Continued follow-up research based on the design discussions:\nEvaluation of CKB partial transaction design based on the newly proposed voting architecture.\nA zkVM-based solution for voting.\nImprovements & Fixes\nCryptography & Performance\nckb-vm ARM64 optimization\nA series of optimizations were merged into ckb-vm to improve performance on ARM64-based hardware:\nSHxADD / ADDUW Instruction Optimization on AArch64 ckb-vm#504\nMULHSU Instruction Optimization on AArch64 ckb-vm#505\nDivision and Remainder Instruction Optimization on AArch64 ckb-vm#506\nAdd Fuzz Tests for RVM Instructions ckb-vm#507\nFinished optimization for Module-Lattice-Based Digital Signature Algorithm (ML-DSA).\nThis further prepares the network for post-quantum cryptographic standards: GitHub - XuJiandong/signatures at use-opt-shake128 · GitHub\nInfra & Tooling\nUpgraded CKB toolchain to 1.95.0: [rust-toolchain] Upgrade Rust toolchain to 1.95.0 #5175\nAdded SKILL.md for AI agents in ckb-debugger, making it easier for agents to assist devs in debugging CKB scripts.: Add SKILL.md for AI agents ckb-standalone-debugger#202\nFixed rich-indexer prefix search upper bound leading zero bytes issue: Incorrect prefix search results in Rich Indexer due to get_binary_upper_boundary() dropping leading zero bytes. #5165\nRe-organized molecule’s Cargo workspace structure: Organize Rust crates into a Cargo workspace molecule#115\nSynced the [ckb musl](GitHub - nervosnetwork/musl: A fork of https://git.musl-libc.org/cgit/musl with Nervos CKB changes · GitHub) fork with upstream: GitHub - mohanson-fork/musl at newest · GitHub\nNetworking & Connectivity\nContinued QUIC support for Tentacle:\nQUIC / UDP address parsing: quic: support quic/udp address parsing tentacle#430\nCertificate generation and verification, plus a simple QUIC smoke test:quic: cert generating and verifying, simple quic smoke test tentacle#431\nThe underlying P2P network Tentacle, is moving closer to full QUIC support. QUIC (built on UDP) provides faster handshake times and better resilience against connection migration compared to standard TCP.\nIn Pipeline\nCore Maintenance & Release Prep\nRocksDB key schema refactor: [BREAKING CHANGE] Refactor rocksdb schema to reduce Read/Write Amplification #5085\nUse the differential test framework to verify CKB-optimized libraries, including sha256, sha512, fip202, and others.\nPrepare for the next CKB release.\nNetworking\nContinue QUIC support for tentacle:\nRustls verifier for QUIC certificate\nQUIC session implementation\nServiceBuilder integration\nGovernance PoC\nContinue the zkVM-based voting system, including spec and demo / PoC.\nReview the previous open transaction design and continue investigating the partial transaction approach.",
          "content_html": "<h1><a name=\"p-24063-updates-1\" class=\"anchor\" href=\"#p-24063-updates-1\" aria-label=\"Heading link\"></a>Updates</h1>\n<h2><a name=\"p-24063-features-2\" class=\"anchor\" href=\"#p-24063-features-2\" aria-label=\"Heading link\"></a><em>Features</em></h2>\n<p><strong>Differential Testing Framework for ckb-vm</strong></p>\n<p>Implemented a differential test framework for ckb-vm. This provides a stronger foundation for validating optimizations and future architecture changes: <a href=\"https://github.com/yuqiliu617/ckb-vm-contrib/tree/differential-test\" class=\"inline-onebox\" rel=\"noopener nofollow ugc\">GitHub - yuqiliu617/ckb-vm-contrib at differential-test · GitHub</a></p>\n<p><strong>CKB DAO Treasury Design &amp; Voting Research</strong></p>\n<p>research on the CKB DAO Treasury design and voting settlement, with multiple directions under active discussion:</p>\n<ul>\n<li><strong>Option 1</strong>: <strong>DAO-Bound Voting with Off-Chain State</strong></li>\n</ul>\n<p>Explores a governance model where voting power is bound to Nervos DAO deposits, while proposal state, voting records, and tally data are maintained off-chain and committed on-chain through verifiable state roots.</p>\n<p>This direction focuses on improving scalability while keeping governance state verifiable.</p>\n<ul>\n<li><strong>Option 2: Experimental MVP Implementation</strong></li>\n</ul>\n<p>Built an experimental MVP to validate core DAO Treasury workflows and voting-related transaction construction. Details see: <a href=\"https://github.com/chenyukang/ckb/blob/dao-treasury/dao-treasury\" class=\"inline-onebox\" rel=\"noopener nofollow ugc\">ckb/dao-treasury at dao-treasury · chenyukang/ckb · GitHub</a></p>\n<ul>\n<li><strong>Option 3: Rollup-inspired Voting Settlement</strong></li>\n</ul>\n<p>Explores a rollup-inspired approach for voting settlement, where voting data is batched and compressed into verifiable state updates.</p>\n<p>This direction focuses on improving settlement efficiency while preserving on-chain verifiability under CKB’s UTXO model.</p>\n<p><strong>Key Question Identified &amp; Follow-Up Research</strong></p>\n<p>How to handle on-chain settlement for the voting process. Continued follow-up research based on the design discussions:</p>\n<ul>\n<li>Evaluation of CKB partial transaction design based on the newly proposed voting architecture.</li>\n<li>A zkVM-based solution for voting.</li>\n</ul>\n<hr>\n<h2><a name=\"p-24063-improvements-fixes-3\" class=\"anchor\" href=\"#p-24063-improvements-fixes-3\" aria-label=\"Heading link\"></a><em>Improvements &amp; Fixes</em></h2>\n<p><strong>Cryptography &amp; Performance</strong></p>\n<ul>\n<li>ckb-vm ARM64 optimization</li>\n</ul>\n<p>A series of optimizations were merged into ckb-vm to improve performance on ARM64-based hardware:</p>\n<ul>\n<li><a href=\"https://github.com/nervosnetwork/ckb-vm/pull/504\" rel=\"noopener nofollow ugc\">SHxADD / ADDUW Instruction Optimization on AArch64 ckb-vm#504</a></li>\n<li><a href=\"https://github.com/nervosnetwork/ckb-vm/pull/505\" rel=\"noopener nofollow ugc\">MULHSU Instruction Optimization on AArch64 ckb-vm#505</a></li>\n<li><a href=\"https://github.com/nervosnetwork/ckb-vm/pull/506\" rel=\"noopener nofollow ugc\">Division and Remainder Instruction Optimization on AArch64 ckb-vm#506</a></li>\n<li><a href=\"https://github.com/nervosnetwork/ckb-vm/pull/507\" rel=\"noopener nofollow ugc\">Add Fuzz Tests for RVM Instructions ckb-vm#507</a></li>\n<li>Finished optimization for Module-Lattice-Based Digital Signature Algorithm (ML-DSA).</li>\n</ul>\n<p>This further prepares the network for post-quantum cryptographic standards: <a href=\"https://github.com/XuJiandong/signatures/tree/use-opt-shake128\" class=\"inline-onebox\" rel=\"noopener nofollow ugc\">GitHub - XuJiandong/signatures at use-opt-shake128 · GitHub</a></p>\n<p><strong>Infra &amp; Tooling</strong></p>\n<ul>\n<li>Upgraded CKB toolchain to 1.95.0: <a href=\"https://github.com/nervosnetwork/ckb/pull/5175\" rel=\"noopener nofollow ugc\">[rust-toolchain] Upgrade Rust toolchain to 1.95.0 #5175</a></li>\n<li>Added <code>SKILL.md</code> for AI agents in ckb-debugger, making it easier for agents to assist devs in debugging CKB scripts.: <a href=\"https://github.com/nervosnetwork/ckb-standalone-debugger/pull/202\" rel=\"noopener nofollow ugc\">Add SKILL.md for AI agents ckb-standalone-debugger#202</a></li>\n<li>Fixed rich-indexer prefix search upper bound leading zero bytes issue: <a href=\"https://github.com/nervosnetwork/ckb/issues/5165\" rel=\"noopener nofollow ugc\">Incorrect prefix search results in Rich Indexer due to get_binary_upper_boundary() dropping leading zero bytes. #5165</a></li>\n<li>Re-organized molecule’s Cargo workspace structure: <a href=\"https://github.com/nervosnetwork/molecule/pull/115\" rel=\"noopener nofollow ugc\">Organize Rust crates into a Cargo workspace molecule#115</a></li>\n<li>Synced the [<a href=\"https://github.com/nervosnetwork/musl\" rel=\"noopener nofollow ugc\">ckb musl</a>](<a href=\"https://github.com/nervosnetwork/musl\" class=\"inline-onebox\" rel=\"noopener nofollow ugc\">GitHub - nervosnetwork/musl: A fork of https://git.musl-libc.org/cgit/musl with Nervos CKB changes · GitHub</a>) fork with upstream: <a href=\"https://github.com/mohanson-fork/musl/tree/newest\" class=\"inline-onebox\" rel=\"noopener nofollow ugc\">GitHub - mohanson-fork/musl at newest · GitHub</a></li>\n</ul>\n<p><strong>Networking &amp; Connectivity</strong></p>\n<ul>\n<li>Continued QUIC support for Tentacle:\n<ul>\n<li>QUIC / UDP address parsing: <a href=\"https://github.com/nervosnetwork/tentacle/pull/430\" rel=\"noopener nofollow ugc\">quic: support quic/udp address parsing tentacle#430</a></li>\n<li>Certificate generation and verification, plus a simple QUIC smoke test:<a href=\"https://github.com/nervosnetwork/tentacle/pull/431\" rel=\"noopener nofollow ugc\">quic: cert generating and verifying, simple quic smoke test tentacle#431</a></li>\n</ul>\n</li>\n</ul>\n<p>The underlying P2P network Tentacle, is moving closer to full QUIC support. QUIC (built on UDP) provides faster handshake times and better resilience against connection migration compared to standard TCP.</p>\n<h2><a name=\"p-24063-in-pipeline-4\" class=\"anchor\" href=\"#p-24063-in-pipeline-4\" aria-label=\"Heading link\"></a><em>In Pipeline</em></h2>\n<p><strong>Core Maintenance &amp; Release Prep</strong></p>\n<ul>\n<li>RocksDB key schema refactor: <a href=\"https://github.com/nervosnetwork/ckb/pull/5085\" rel=\"noopener nofollow ugc\">[BREAKING CHANGE] Refactor rocksdb schema to reduce Read/Write Amplification #5085</a></li>\n<li>Use the differential test framework to verify CKB-optimized libraries, including sha256, sha512, fip202, and others.</li>\n<li>Prepare for the next CKB release.</li>\n</ul>\n<p><strong>Networking</strong></p>\n<ul>\n<li>Continue QUIC support for tentacle:\n<ul>\n<li>Rustls verifier for QUIC certificate</li>\n<li>QUIC session implementation</li>\n<li>ServiceBuilder integration</li>\n</ul>\n</li>\n</ul>\n<p><strong>Governance PoC</strong></p>\n<ul>\n<li>Continue the zkVM-based voting system, including spec and demo / PoC.</li>\n<li>Review the previous open transaction design and continue investigating the partial transaction approach.</li>\n</ul>",
          "like_count": 0,
          "quote_count": 0
        }
      ]
    },
    {
      "topic_id": 10214,
      "title": "Spark Program | CKB-VM Sail Formal Verification — Proving CKB-VM RISC-V Instruction Equivalence via Sail Specification and Coq Theorem Prover / CKB-VM Sail 形式化验证 — 基于 Sail 规范与 Coq 定理证明器的 CKB-VM RISC-V 指令等价性证明",
      "slug": "spark-program-ckb-vm-sail-formal-verification-proving-ckb-vm-risc-v-instruction-equivalence-via-sail-specification-and-coq-theorem-prover-ckb-vm-sail-sail-coq-ckb-vm-risc-v",
      "url": "https://talk.nervos.org/t/spark-program-ckb-vm-sail-formal-verification-proving-ckb-vm-risc-v-instruction-equivalence-via-sail-specification-and-coq-theorem-prover-ckb-vm-sail-sail-coq-ckb-vm-risc-v/10214",
      "created_at": "2026-04-27T21:14:40.905000+00:00",
      "last_posted_at": "2026-04-29T02:00:41.314000+00:00",
      "category_id": 49,
      "tags": [
        "Spark-Program"
      ],
      "posters": [
        "Original Poster",
        "Frequent Poster",
        "Frequent Poster",
        "Most Recent Poster"
      ],
      "recent_posts": [
        {
          "post_id": 24054,
          "post_number": 3,
          "topic_id": 10214,
          "topic_title": "Spark Program | CKB-VM Sail Formal Verification — Proving CKB-VM RISC-V Instruction Equivalence via Sail Specification and Coq Theorem Prover / CKB-VM Sail 形式化验证 — 基于 Sail 规范与 Coq 定理证明器的 CKB-VM RISC-V 指令等价性证明",
          "topic_slug": "spark-program-ckb-vm-sail-formal-verification-proving-ckb-vm-risc-v-instruction-equivalence-via-sail-specification-and-coq-theorem-prover-ckb-vm-sail-sail-coq-ckb-vm-risc-v",
          "author": "ArthurZhang",
          "created_at": "2026-04-28T03:13:16.520000+00:00",
          "updated_at": "2026-04-28T03:15:44.546000+00:00",
          "reply_to_post_number": null,
          "url": "https://talk.nervos.org/t/spark-program-ckb-vm-sail-formal-verification-proving-ckb-vm-risc-v-instruction-equivalence-via-sail-specification-and-coq-theorem-prover-ckb-vm-sail-sail-coq-ckb-vm-risc-v/10214/3",
          "content_text": "This looks like a valuable direction. A further verified CKB-VM foundation would strengthen the whole CKB scripting stack. Proving instruction-level equivalence against the Sail RISC-V specification feels like the kind of deep infrastructure work that may not be immediately visible to application developers, but i reckon it compounds over time.\nBest of luck.",
          "content_html": "<p>This looks like a valuable direction. A further verified CKB-VM foundation would strengthen the whole CKB scripting stack. Proving instruction-level equivalence against the Sail RISC-V specification feels like the kind of deep infrastructure work that may not be immediately visible to application developers, but i reckon it compounds over time.</p>\n<p>Best of luck.</p>",
          "like_count": 0,
          "quote_count": 0
        },
        {
          "post_id": 24062,
          "post_number": 4,
          "topic_id": 10214,
          "topic_title": "Spark Program | CKB-VM Sail Formal Verification — Proving CKB-VM RISC-V Instruction Equivalence via Sail Specification and Coq Theorem Prover / CKB-VM Sail 形式化验证 — 基于 Sail 规范与 Coq 定理证明器的 CKB-VM RISC-V 指令等价性证明",
          "topic_slug": "spark-program-ckb-vm-sail-formal-verification-proving-ckb-vm-risc-v-instruction-equivalence-via-sail-specification-and-coq-theorem-prover-ckb-vm-sail-sail-coq-ckb-vm-risc-v",
          "author": "xingtianchunyan",
          "created_at": "2026-04-29T02:00:41.314000+00:00",
          "updated_at": "2026-04-29T02:00:41.314000+00:00",
          "reply_to_post_number": null,
          "url": "https://talk.nervos.org/t/spark-program-ckb-vm-sail-formal-verification-proving-ckb-vm-risc-v-instruction-equivalence-via-sail-specification-and-coq-theorem-prover-ckb-vm-sail-sail-coq-ckb-vm-risc-v/10214/4",
          "content_text": "@TinyuengKwan 你好，感谢提交 ckb-vm-sail-verify 的提案。这个方向非常关键：CKB-VM 是 CKB 的执行层安全基石，而你提出的 “Sail（官方规范）+ Coq（形式化证明）+ 差分测试（工程验证）” 的双轨方案，也能看出你在 Sail/ACT 生态中有相对稀缺的一手经验与工程积累。\n需要说明的是，该项目已明显超出星火计划的支持范围（以 低门槛、快节奏 的方式帮助社区开发者启动小型原型项目）。如果你仍希望向委员会正式递交该项目，那么在提交委员会正式评审前，我个人建议你按 Spark 的预审要求补充/澄清以下几点信息，让评审能更快收敛到“本期资助是什么、怎么验收、验收失败影响什么”。\n1) 验证方案（How to Verify）——写成“可复现步骤 + 通过/失败标准 + 影响面（blast radius）”\n你在主贴里讲清楚了技术架构与实现设想，但仍缺少一段面向评审者/社区的 “低成本、可重复” 的验收章节。建议新增一个独立章节 How to Verify（或在 repo 内提供 VERIFICATION.md），至少包含：\n1.1 可复现步骤（评审者照着做即可）\n建议最少做到以下级别：\n环境一键化：提供 Docker/Nix/脚本，确保在干净 Linux 环境下可以复现（并 pin 关键依赖版本：Sail/Coq/OPAM/Rust 等）。\n生成与构建：从 sail-riscv 生成 Coq 产物的命令（以及预期输出在哪里）。\n证明检查：如何运行 Coq proof check（命令 + 预期输出）。\n差分测试：如何运行 diff-test（命令 + 测试集来源 + mismatch 报告格式）。\n证据发布位置：每个里程碑的证据放在哪里（release tag / CI artifact / 报告 / 日志）。\n1.2 通过/失败标准（必须可判定）\n形式化验证最大的风险，是“最后没有一个可客观判定的结果”。请明确写清：\n本期承诺证明/验证的指令子集是什么（例如 RV64I 的某些指令集合；或只覆盖 CKB-VM 的某个 VERSION 模式——你提到 VERSION2 only 的倾向，需要明确）。\n对 Coq 证明的要求：是否允许 admit？若允许，允许到什么边界（建议明确为“0 admitted”或“admitted 上限”）。\n差分测试的要求：覆盖哪些测试集（riscv-tests / riscv-arch-test / 自建 corpus），以及遇到 mismatch 的处理策略（“发现问题即成果” vs “必须修到一致才算通过”）。\n1.3 影响面（blast radius）与“保证边界”\n由于 CKB-VM 属于执行层安全边界，请你在文档中非常明确地写清：\n保证了什么：例如“对某版本/某子集/某模型假设下的语义等价”。\n没有保证什么：例如 MOP 扩展、cycle 计费、syscall/ECALL 语义差异、flat 4MB memory model 的限制、并发/原子指令的模型化边界等。\n如果外部误解为“全量等价/全链安全已证明”，可能造成的误用风险是什么，以及你准备如何在 README/报告中避免过度解读。\n2) 结项后的维护计划（持续有效性）\n形式化验证的成果很容易因上游演进而失效。建议你补充：\n你承诺维护的窗口（例如覆盖到 ckb-vm 的哪些版本线；以及 sail-riscv 更新后的跟进方式）。\nCI 策略：是否会把 proof check + diff-test 做成持续集成（例如 weekly/nightly），并在 README 放 badge。\n交接方案：如果你后续无法维护，社区如何接手（依赖锁定、脚本、文档、最小复现路径）。\n3) 部署架构与安全性（供应链 / 可信边界 / 结果可审计性）\n该项目不是线上服务，但仍然有“可信边界”与“供应链安全”需要写清：\n工具链与依赖如何 pin 版本（OPAM/Coq/Sail 等），避免“换个时间/换台机器就跑不出来”。\n是否提供可重复构建说明（reproducible build）与产物校验方式（hash / release artifact）。\n对外发布时的“声明口径”：建议在最终报告中用一段固定文本写清适用范围与非目标（避免被当作全量安全证明）。\n请在针对以上建议优化后 @ 我一下，我们再把更新版本提交委员会进入正式评审流程。",
          "content_html": "<p><a class=\"mention\" href=\"/u/tinyuengkwan\">@TinyuengKwan</a> 你好，感谢提交 <strong><code>ckb-vm-sail-verify</code></strong> 的提案。这个方向非常关键：CKB-VM 是 CKB 的执行层安全基石，而你提出的 “Sail（官方规范）+ Coq（形式化证明）+ 差分测试（工程验证）” 的双轨方案，也能看出你在 Sail/ACT 生态中有相对稀缺的一手经验与工程积累。</p>\n<p>需要说明的是，该项目已明显超出星火计划的支持范围（以 <strong>低门槛、快节奏</strong> 的方式帮助社区开发者启动小型原型项目）。如果你仍希望向委员会正式递交该项目，那么在提交委员会正式评审前，我个人建议你按 Spark 的预审要求补充/澄清以下几点信息，让评审能更快收敛到“本期资助是什么、怎么验收、验收失败影响什么”。</p>\n<hr>\n<h3><a name=\"p-24062-h-1-how-to-verify-blast-radius-1\" class=\"anchor\" href=\"#p-24062-h-1-how-to-verify-blast-radius-1\" aria-label=\"Heading link\"></a>1)  验证方案（How to Verify）——写成“可复现步骤 + 通过/失败标准 + 影响面（blast radius）”</h3>\n<p>你在主贴里讲清楚了技术架构与实现设想，但仍缺少一段面向评审者/社区的 <strong>“低成本、可重复”</strong> 的验收章节。建议新增一个独立章节 <code>How to Verify</code>（或在 repo 内提供 <code>VERIFICATION.md</code>），至少包含：</p>\n<h4><a name=\"p-24062-h-11-2\" class=\"anchor\" href=\"#p-24062-h-11-2\" aria-label=\"Heading link\"></a>1.1 可复现步骤（评审者照着做即可）</h4>\n<p>建议最少做到以下级别：</p>\n<ol>\n<li><strong>环境一键化</strong>：提供 Docker/Nix/脚本，确保在干净 Linux 环境下可以复现（并 pin 关键依赖版本：Sail/Coq/OPAM/Rust 等）。</li>\n<li><strong>生成与构建</strong>：从 sail-riscv 生成 Coq 产物的命令（以及预期输出在哪里）。</li>\n<li><strong>证明检查</strong>：如何运行 Coq proof check（命令 + 预期输出）。</li>\n<li><strong>差分测试</strong>：如何运行 diff-test（命令 + 测试集来源 + mismatch 报告格式）。</li>\n<li><strong>证据发布位置</strong>：每个里程碑的证据放在哪里（release tag / CI artifact / 报告 / 日志）。</li>\n</ol>\n<h4><a name=\"p-24062-h-12-3\" class=\"anchor\" href=\"#p-24062-h-12-3\" aria-label=\"Heading link\"></a>1.2 通过/失败标准（必须可判定）</h4>\n<p>形式化验证最大的风险，是“最后没有一个可客观判定的结果”。请明确写清：</p>\n<ul>\n<li>本期承诺证明/验证的<strong>指令子集</strong>是什么（例如 RV64I 的某些指令集合；或只覆盖 CKB-VM 的某个 VERSION 模式——你提到 VERSION2 only 的倾向，需要明确）。</li>\n<li>对 Coq 证明的要求：是否允许 <code>admit</code>？若允许，允许到什么边界（建议明确为“0 admitted”或“admitted 上限”）。</li>\n<li>差分测试的要求：覆盖哪些测试集（riscv-tests / riscv-arch-test / 自建 corpus），以及遇到 mismatch 的处理策略（“发现问题即成果” vs “必须修到一致才算通过”）。</li>\n</ul>\n<h4><a name=\"p-24062-h-13-blast-radius-4\" class=\"anchor\" href=\"#p-24062-h-13-blast-radius-4\" aria-label=\"Heading link\"></a>1.3 影响面（blast radius）与“保证边界”</h4>\n<p>由于 CKB-VM 属于执行层安全边界，请你在文档中非常明确地写清：</p>\n<ul>\n<li><strong>保证了什么</strong>：例如“对某版本/某子集/某模型假设下的语义等价”。</li>\n<li><strong>没有保证什么</strong>：例如 MOP 扩展、cycle 计费、syscall/ECALL 语义差异、flat 4MB memory model 的限制、并发/原子指令的模型化边界等。</li>\n<li>如果外部误解为“全量等价/全链安全已证明”，可能造成的误用风险是什么，以及你准备如何在 README/报告中避免过度解读。</li>\n</ul>\n<hr>\n<h3><a name=\"p-24062-h-2-5\" class=\"anchor\" href=\"#p-24062-h-2-5\" aria-label=\"Heading link\"></a>2) 结项后的维护计划（持续有效性）</h3>\n<p>形式化验证的成果很容易因上游演进而失效。建议你补充：</p>\n<ul>\n<li>你承诺维护的窗口（例如覆盖到 <code>ckb-vm</code> 的哪些版本线；以及 sail-riscv 更新后的跟进方式）。</li>\n<li>CI 策略：是否会把 proof check + diff-test 做成持续集成（例如 weekly/nightly），并在 README 放 badge。</li>\n<li>交接方案：如果你后续无法维护，社区如何接手（依赖锁定、脚本、文档、最小复现路径）。</li>\n</ul>\n<hr>\n<h3><a name=\"p-24062-h-3-6\" class=\"anchor\" href=\"#p-24062-h-3-6\" aria-label=\"Heading link\"></a>3) 部署架构与安全性（供应链 / 可信边界 / 结果可审计性）</h3>\n<p>该项目不是线上服务，但仍然有“可信边界”与“供应链安全”需要写清：</p>\n<ul>\n<li>工具链与依赖如何 pin 版本（OPAM/Coq/Sail 等），避免“换个时间/换台机器就跑不出来”。</li>\n<li>是否提供可重复构建说明（reproducible build）与产物校验方式（hash / release artifact）。</li>\n<li>对外发布时的“声明口径”：建议在最终报告中用一段固定文本写清适用范围与非目标（避免被当作全量安全证明）。</li>\n</ul>\n<hr>\n<p>请在针对以上建议优化后 @ 我一下，我们再把更新版本提交委员会进入正式评审流程。</p>",
          "like_count": 0,
          "quote_count": 0
        }
      ]
    },
    {
      "topic_id": 10212,
      "title": "Spark Program | Dular",
      "slug": "spark-program-dular",
      "url": "https://talk.nervos.org/t/spark-program-dular/10212",
      "created_at": "2026-04-26T16:35:20.268000+00:00",
      "last_posted_at": "2026-04-29T01:49:03.597000+00:00",
      "category_id": 49,
      "tags": [
        "Spark-Program",
        "Submitted"
      ],
      "posters": [
        "Original Poster",
        "Frequent Poster",
        "Most Recent Poster"
      ],
      "recent_posts": [
        {
          "post_id": 24061,
          "post_number": 4,
          "topic_id": 10212,
          "topic_title": "Spark Program | Dular",
          "topic_slug": "spark-program-dular",
          "author": "xingtianchunyan",
          "created_at": "2026-04-29T01:49:03.597000+00:00",
          "updated_at": "2026-04-29T01:49:36.043000+00:00",
          "reply_to_post_number": 3,
          "url": "https://talk.nervos.org/t/spark-program-dular/10212/4",
          "content_text": "@duongja 你好，感谢你提交 Dular 项目提案，并根据预审建议补充了 How to Verify、预算拆分与风险说明等内容。整体方向（Fiber + RUSD/UDT + 真实在地试点）很契合 Spark 对“可落地、可验证”的资助导向。\n本项目当前状态暂定为 Pending，并非否定项目价值，而是表示：在进入下次正式评审/投票前，我们还需要你补齐两类“可验证凭据”，并纠正提案中关于 CKB 发放与汇率折算机制的表述，以避免后续沟通成本与验收争议。\n1) 请补充：Daraja 生产环境凭据的可视化证据。\n你已声明持有 STK Push + B2C 生产凭据，请通过私信发我一张 Daraja Developer Portal 的后台截图（API Key 字段可全部打码，但需保留 Production 环境标识、App 名称、创建时间）。这一步用于委员会内部尽调，不会公开。委员会在验证后，会在本帖中发布一条信息，说明已经核验。\n2) 请补充：现有 multi-hop RUSD payment 的可核验证据。\n为了让委员会/社区能低成本持续复核你的技术交付（而不是只在结项时“补材料”），请在主楼直接给出对应的 Fiber payment hash 或相关 CKB testnet tx hash，并简要说明 Fiber payment hash 与 CKB L1 explorer 的对应关系（避免读者误以为每笔 Fiber 支付都能在 CKB explorer 直查）。\n3) 关于 CKB 发放与汇率风险：请修正提案中的流程假设，并按 Spark 口径重写\n你在提案的 “CKB Disbursement & Exchange Rate Risk” 中写到：\n“The CKB amount per milestone will be calculated at the market rate on the day of disbursement, per standard Spark procedure.”\n这里需要更正：Spark 的标准口径不是“每次发放按当日汇率重新折算 CKB”。\nSpark 2026 资金口径：\n发放币种：当前周期 Spark 资助仅支持100% 以 CKB 形式发放。\n折算口径（锁定）：委员会通常以 USD 口径沟通资助额度以便横向对比；但若以 CKB 发放，则会在项目审批通过时点以当时参考汇率折算并确定该项目的 CKB 总额，并在对外决议/公告中公示。后续发放以约定的 CKB 数额为基准执行，不随每次发放当日价格重新折算。\n汇率风险：审批通过后至实际支出期间，CKB 价格波动导致的购买力变化风险由提案申请人/团队自行承担，这是CKB生态相关grants的惯例。对存在法币硬成本的项目，建议在收到款项后及时换汇/锁定硬成本，并在预算中清晰区分“硬成本 vs 人力成本”。\n下一步\n请根据上述内容优化提案后，在本帖回复“已更新”并@xingtianchunyan，并标明你更新了主楼的哪些小节/新增了哪些链接或附件。委员会会在信息齐备后尽快推进正式评审流程。\n祝好，\n行天\n代表星火计划委员会",
          "content_html": "<p><a class=\"mention\" href=\"/u/duongja\">@duongja</a> 你好，感谢你提交 Dular 项目提案，并根据预审建议补充了 How to Verify、预算拆分与风险说明等内容。整体方向（Fiber + RUSD/UDT + 真实在地试点）很契合 Spark 对“可落地、可验证”的资助导向。</p>\n<p>本项目当前状态暂定为 <strong>Pending</strong>，并非否定项目价值，而是表示：在进入下次正式评审/投票前，我们还需要你补齐两类“可验证凭据”，并纠正提案中关于 CKB 发放与汇率折算机制的表述，以避免后续沟通成本与验收争议。</p>\n<hr>\n<h4><a name=\"p-24061-h-1-daraja-1\" class=\"anchor\" href=\"#p-24061-h-1-daraja-1\" aria-label=\"Heading link\"></a>1) 请补充：Daraja 生产环境凭据的可视化证据。</h4>\n<p>你已声明持有 STK Push + B2C 生产凭据，请通过私信发我一张 Daraja Developer Portal 的后台截图（API Key 字段可全部打码，但需保留 Production 环境标识、App 名称、创建时间）。这一步用于委员会内部尽调，不会公开。委员会在验证后，会在本帖中发布一条信息，说明已经核验。</p>\n<hr>\n<h4><a name=\"p-24061-h-2-multi-hop-rusd-payment-2\" class=\"anchor\" href=\"#p-24061-h-2-multi-hop-rusd-payment-2\" aria-label=\"Heading link\"></a>2) 请补充：现有 multi-hop RUSD payment 的可核验证据。</h4>\n<p>为了让委员会/社区能低成本持续复核你的技术交付（而不是只在结项时“补材料”），请在主楼直接给出对应的 Fiber payment hash 或相关 CKB testnet tx hash，并简要说明 Fiber payment hash 与 CKB L1 explorer 的对应关系（避免读者误以为每笔 Fiber 支付都能在 CKB explorer 直查）。</p>\n<hr>\n<h4><a name=\"p-24061-h-3-ckb-spark-3\" class=\"anchor\" href=\"#p-24061-h-3-ckb-spark-3\" aria-label=\"Heading link\"></a>3) 关于 CKB 发放与汇率风险：请修正提案中的流程假设，并按 Spark 口径重写</h4>\n<p>你在提案的 “CKB Disbursement &amp; Exchange Rate Risk” 中写到：</p>\n<blockquote>\n<p>“The CKB amount per milestone will be calculated at the market rate on the day of disbursement, per standard Spark procedure.”</p>\n</blockquote>\n<p>这里需要更正：Spark 的标准口径不是“每次发放按当日汇率重新折算 CKB”。</p>\n<p><strong>Spark 2026 资金口径：</strong></p>\n<ol>\n<li><strong>发放币种</strong>：当前周期 Spark 资助仅支持<strong>100% 以 CKB 形式发放</strong>。</li>\n<li><strong>折算口径（锁定）</strong>：委员会通常以 USD 口径沟通资助额度以便横向对比；但若以 CKB 发放，则会在项目<strong>审批通过时点</strong>以当时参考汇率折算并确定该项目的 <strong>CKB 总额</strong>，并在对外决议/公告中公示。后续发放以约定的 CKB 数额为基准执行，不随每次发放当日价格重新折算。</li>\n<li><strong>汇率风险</strong>：审批通过后至实际支出期间，CKB 价格波动导致的购买力变化风险由提案申请人/团队自行承担，这是CKB生态相关grants的惯例。对存在法币硬成本的项目，建议在收到款项后及时换汇/锁定硬成本，并在预算中清晰区分“硬成本 vs 人力成本”。</li>\n</ol>\n<hr>\n<h4><a name=\"p-24061-h-4\" class=\"anchor\" href=\"#p-24061-h-4\" aria-label=\"Heading link\"></a>下一步</h4>\n<p>请根据上述内容优化提案后，在本帖回复“已更新”并@xingtianchunyan，并标明你更新了主楼的哪些小节/新增了哪些链接或附件。委员会会在信息齐备后尽快推进正式评审流程。</p>\n<p>祝好，<br>\n行天<br>\n代表星火计划委员会</p>",
          "like_count": 0,
          "quote_count": 0
        }
      ]
    },
    {
      "topic_id": 10130,
      "title": "Introducing CKB Kickstarter: Decentralized All-or-Nothing Crowdfunding on Nervos CKB (Testnet MVP Live)",
      "slug": "introducing-ckb-kickstarter-decentralized-all-or-nothing-crowdfunding-on-nervos-ckb-testnet-mvp-live",
      "url": "https://talk.nervos.org/t/introducing-ckb-kickstarter-decentralized-all-or-nothing-crowdfunding-on-nervos-ckb-testnet-mvp-live/10130",
      "created_at": "2026-03-25T20:44:37.875000+00:00",
      "last_posted_at": "2026-04-28T17:10:24.005000+00:00",
      "category_id": 32,
      "tags": [
        "CKB",
        "dapp"
      ],
      "posters": [
        "Original Poster, Most Recent Poster",
        "Frequent Poster",
        "Frequent Poster",
        "Frequent Poster",
        "Frequent Poster"
      ],
      "recent_posts": [
        {
          "post_id": 24060,
          "post_number": 9,
          "topic_id": 10130,
          "topic_title": "Introducing CKB Kickstarter: Decentralized All-or-Nothing Crowdfunding on Nervos CKB (Testnet MVP Live)",
          "topic_slug": "introducing-ckb-kickstarter-decentralized-all-or-nothing-crowdfunding-on-nervos-ckb-testnet-mvp-live",
          "author": "Ayoub_Lesfer",
          "created_at": "2026-04-28T17:10:24.005000+00:00",
          "updated_at": "2026-04-28T17:10:24.005000+00:00",
          "reply_to_post_number": null,
          "url": "https://talk.nervos.org/t/introducing-ckb-kickstarter-decentralized-all-or-nothing-crowdfunding-on-nervos-ckb-testnet-mvp-live/10130/9",
          "content_text": "Update: Automatic Finalization Bot live on testnet\nFollowing up on the v1.1 update above: the bot is deployed and end-to-end verified on testnet as of yesterday (2026-04-27). The platform is now fully trustless on testnet, campaigns flow create → pledge → deadline → distribution with zero manual intervention from anyone (creator, backer, or platform operator).\nWhat the bot does (each polling cycle, every 10s):\nDetects expired campaigns still in Active status → submits permissionless finalizeCampaign tx (Success if total pledged ≥ goal, Failed otherwise)\nFor finalized Success campaigns with remaining live pledge cells → submits permissionlessRelease tx (funds → creator)\nFor finalized Failed campaigns with remaining live pledge cells → submits permissionlessRefund tx (funds → backer)\nArchitecture:\nSingle FinalizationBot class integrated into the existing indexer process (no separate service)\nRuns on Render free tier inside the same container as the indexer\nBot wallet funded with 100k CKB testnet, fees are negligible (~0.001 CKB per finalize/distribute)\nBot is optional: if BOT_PRIVATE_KEY env var is unset, the indexer runs normally and users can still trigger finalize/release/refund manually from the UI\nBot needs no special permissions, every contract entry point it calls is permissionless on-chain. The bot is a convenience, not a trust dependency.\nE2E verification on testnet (2026-04-27):\nPath\nGoal\nPledged\nOutcome\nSuccess\n200 CKB\n250 CKB\nBot auto-finalized as Success → auto-released to creator (release tx 0x564c6d7a...)\nFailed\n10,000 CKB\n100 CKB\nBot auto-finalized as Failed → auto-refunded to backer (refund tx 0x54fd7e40...)\nTotal time from deadline to full distribution: ~30 seconds.\nTry it yourself: https://decentralized-kickstarter-kappa.vercel.app/ create a campaign with a short deadline, pledge from a second JoyID account, and watch the bot do its thing.\nWhat’s next:\nExternal code review of v1.1 contracts\nSustainable platform business model (fees + treasury), open to community input on what feels right for an ecosystem-funded project\nMainnet deployment",
          "content_html": "<p><strong>Update: Automatic Finalization Bot live on testnet</strong> <img src=\"https://talk.nervos.org/images/emoji/apple/white_check_mark.png?v=15\" title=\":white_check_mark:\" class=\"emoji\" alt=\":white_check_mark:\" loading=\"lazy\" width=\"20\" height=\"20\"></p>\n<p>Following up on the v1.1 update above: the bot is deployed and end-to-end verified on testnet as of yesterday (2026-04-27). The platform is now <strong>fully trustless</strong> on testnet, campaigns flow <code>create → pledge → deadline → distribution</code> with zero manual intervention from anyone (creator, backer, or platform operator).</p>\n<p><strong>What the bot does</strong> (each polling cycle, every 10s):</p>\n<ul>\n<li>Detects expired campaigns still in <code>Active</code> status → submits permissionless <code>finalizeCampaign</code> tx (Success if total pledged ≥ goal, Failed otherwise)</li>\n<li>For finalized Success campaigns with remaining live pledge cells → submits <code>permissionlessRelease</code> tx (funds → creator)</li>\n<li>For finalized Failed campaigns with remaining live pledge cells → submits <code>permissionlessRefund</code> tx (funds → backer)</li>\n</ul>\n<p><strong>Architecture:</strong></p>\n<ul>\n<li>Single <code>FinalizationBot</code> class integrated into the existing indexer process (no separate service)</li>\n<li>Runs on Render free tier inside the same container as the indexer</li>\n<li>Bot wallet funded with 100k CKB testnet, fees are negligible (~0.001 CKB per finalize/distribute)</li>\n<li>Bot is <strong>optional</strong>: if <code>BOT_PRIVATE_KEY</code> env var is unset, the indexer runs normally and users can still trigger finalize/release/refund manually from the UI</li>\n<li>Bot needs <strong>no special permissions</strong>, every contract entry point it calls is permissionless on-chain. The bot is a convenience, not a trust dependency.</li>\n</ul>\n<p><strong>E2E verification on testnet (2026-04-27):</strong></p>\n<div class=\"md-table\">\n<table>\n<thead>\n<tr>\n<th>Path</th>\n<th>Goal</th>\n<th>Pledged</th>\n<th>Outcome</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Success</td>\n<td>200 CKB</td>\n<td>250 CKB</td>\n<td>Bot auto-finalized as Success → auto-released to creator (release tx <code>0x564c6d7a...</code>)</td>\n</tr>\n<tr>\n<td>Failed</td>\n<td>10,000 CKB</td>\n<td>100 CKB</td>\n<td>Bot auto-finalized as Failed → auto-refunded to backer (refund tx <code>0x54fd7e40...</code>)</td>\n</tr>\n</tbody>\n</table>\n</div><p>Total time from deadline to full distribution: <strong>~30 seconds.</strong></p>\n<p><strong>Try it yourself:</strong> <a href=\"https://decentralized-kickstarter-kappa.vercel.app/\" rel=\"noopener nofollow ugc\">https://decentralized-kickstarter-kappa.vercel.app/</a> create a campaign with a short deadline, pledge from a second JoyID account, and watch the bot do its thing.</p>\n<p><strong>What’s next:</strong></p>\n<ul>\n<li>External code review of v1.1 contracts</li>\n<li>Sustainable platform business model (fees + treasury), open to community input on what feels right for an ecosystem-funded project</li>\n<li>Mainnet deployment</li>\n</ul>",
          "like_count": 0,
          "quote_count": 0
        }
      ]
    },
    {
      "topic_id": 9995,
      "title": "Spark Program | Nervos Brain - A Global Developer Onboarding Engine and Cross-Language Hub Powered by Agentic RAG",
      "slug": "spark-program-nervos-brain-a-global-developer-onboarding-engine-and-cross-language-hub-powered-by-agentic-rag",
      "url": "https://talk.nervos.org/t/spark-program-nervos-brain-a-global-developer-onboarding-engine-and-cross-language-hub-powered-by-agentic-rag/9995",
      "created_at": "2026-02-25T09:58:43.726000+00:00",
      "last_posted_at": "2026-04-28T14:03:27.779000+00:00",
      "category_id": 49,
      "tags": [
        "In-Progress",
        "Spark-Program"
      ],
      "posters": [
        "Original Poster",
        "Frequent Poster",
        "Frequent Poster",
        "Most Recent Poster"
      ],
      "recent_posts": [
        {
          "post_id": 24056,
          "post_number": 31,
          "topic_id": 9995,
          "topic_title": "Spark Program | Nervos Brain - A Global Developer Onboarding Engine and Cross-Language Hub Powered by Agentic RAG",
          "topic_slug": "spark-program-nervos-brain-a-global-developer-onboarding-engine-and-cross-language-hub-powered-by-agentic-rag",
          "author": "IrisNeko",
          "created_at": "2026-04-28T11:54:09.506000+00:00",
          "updated_at": "2026-04-28T11:54:09.506000+00:00",
          "reply_to_post_number": null,
          "url": "https://talk.nervos.org/t/spark-program-nervos-brain-a-global-developer-onboarding-engine-and-cross-language-hub-powered-by-agentic-rag/9995/31",
          "content_text": "第七周周报\n一、本周目标（工具闭环与评测基线周）\n本周承接第六周“多轮可持续交互”阶段的工作，重点从“机制已经具备”推进到“关键路径真正闭环、且后续可以被稳定评测”。核心目标有四个：\n继续治理运行时日志噪音，补齐最小可观测性闭环。\n让 discourse_query / github_search 从协议层定义走到图执行主路径可调用。\n建立第一版多轮评测集，为后续 benchmark 和量化回归提供统一输入。\n补齐 Telegram / Discord 两端在长消息与异常路径下的稳定性回归。\n二、本周完成\n日志治理与诊断视图补齐\n已对常见第三方库日志进行了统一降噪处理，补充了 quiet_loggers 与 third_party_level 控制项，降低了测试和运行期的无关日志干扰。同时把工具执行过程的摘要信息接入到图状态与 trace_summary 中，使得一次回答至少可以追踪到“执行了哪些 tool、各自成功/为空/失败”的最小诊断视图，而不再只能看到最终回答文本。\n工具执行闭环推进到运行时主路径\n本周把 discourse_query 与 github_search 从“schema 已存在但主路径未接通”的状态推进到了可被 RetrieverPlanner -> RetrievalExecutor -> ToolRuntime 实际调用的状态。具体包括：\nRetrieverPlanner 归一化阶段已允许这两个 tool 保留，不再强制回退到 qdrant_search。\nToolRuntime 为二者补齐了 handler。\nhandler 采用“transport 优先、本地 archive fallback 兜底”的策略：若存在外部 transport，则优先通过 transport 执行；若 transport 不可用，则回退到本地 archive/BM25 查询，保证评测和离线回归场景下依旧能走通闭环。\ntimeout / idempotency / error normalization 已统一复用 execute_tool，不再为新增 tool 走特殊旁路逻辑。\n回答链路的可解释性继续增强\n在原有“回答生成稳态与兜底能力增强”的基础上，本周补充了 tool 级别的错误码与执行摘要，使回答不仅能在失败时兜底，也能在 trace 中解释失败原因。例如工具执行异常会落到统一错误结构中，而不是静默吞掉。这样后续排查时，可以区分“模型答错”“检索为空”“工具执行异常”“预算截断”等不同问题来源。\n多轮评测集第一版落地\n已新增 evaluation/week7_multiturn_eval.jsonl 作为第一版多轮 benchmark 输入集，并新增配套的 loader / validator，确保每条 case 至少满足：\n有稳定 case_id；\n属于明确任务类别；\n至少包含两轮以上对话；\n带有 success_criteria；\n带有 expected_signals 用于后续自动比对。\n平台稳定性回归增强\nTelegram / Discord 两端都补充了：\ngraph runner 异常时的 fallback 测试；\n超长回答的分段发送测试；\n与 full graph 一致的输出适配回归。\n这部分虽然还没有做到真正的在线压测，但已把“长文本切分”和“异常不崩溃”这两条高风险路径先用自动化测试钉住。\nWeek 7 相关回归验证通过\n本周相关测试，结果为：\n78 passed, 1 warning\n覆盖范围包括 tool runtime、graph executor、logging system、Discord/TG runtime 以及 evaluation dataset 验证。\n三、本周重点：评测流程与 Benchmark 构建思路\n本周最重要的新工作，不是单纯“加了几条样例”，而是把后续多轮评测的基本方法论先定下来。这里单独展开说明。\n3.1 为什么这一周先做评测集，而不是直接做分数面板\n当前项目虽然已经具备多轮补参、反思分流、回答兜底等机制，但如果没有稳定 benchmark，任何关于“效果更好了”的结论都只能靠主观感受。尤其是多轮系统，很容易出现下面几类错觉：\n看起来会追问了，但追问的问题不一定对。\n看起来会恢复上下文了，但恢复后不一定真的沿着原问题推进。\n看起来工具更多了，但新工具可能并没有真正参与主路径决策。\n看起来回答更长了，但未必更可引用、更稳定。\n因此本周没有直接上“评分 dashboard”，而是先做 benchmark 输入层。原因很简单：没有稳定输入，就没有稳定指标；没有稳定 case，任何分数都不可复现。\n3.2 Benchmark 的目标不是“考模型常识”，而是考系统闭环能力\n本周设计 benchmark 时，刻意没有把重点放在开放式知识问答上，而是围绕 Nervos Brain 当前最关键的系统能力来建样例。也就是说，这份 benchmark 的目标不是测试“模型懂不懂 CKB”，而是测试下面这些系统行为是否发生：\n是否会在缺少必要参数时进行合理追问。\n追问后是否能保留线程上下文并继续执行，而不是重新开始。\n是否会根据任务类型选择合适的 tool，而不是永远只走 qdrant_search。\n面对文档与代码冲突、日志不完整、用户目标不明确等真实场景时，是否会做出保守且可解释的决策。\n最终回答是否带引用、是否避免过度编造、是否体现任务导向。\n换句话说，这份 benchmark 更像是“系统工作流 benchmark”，而不是“百科知识 benchmark”。\n3.3 为什么选这三类任务\n本周评测集按三类任务拆分：\nsolution_recommendation（方案推荐）\ndevelopment_guidance（开发指导）\ntroubleshooting（排障定位）\n这样拆分的原因是，这三类任务对应了系统最典型、也最容易出错的三种工作模式。\nsolution_recommendation 关注的是：\n用户目标还比较模糊；\n系统要先帮助缩小问题空间；\n容易因为用户补充信息变化而改写推荐路径。\ndevelopment_guidance 关注的是：\n用户往往已经有明确目标，但缺技术细节；\n系统要在检索、补参、代码示例、分阶段步骤之间做平衡；\n很适合验证 AskUser → resume → answer 的闭环。\ntroubleshooting 关注的是：\n用户信息往往不完整；\n证据冲突与日志缺失更常见；\n系统必须优先保守，不应过早武断下结论。\n这三类覆盖面并不等于项目全部任务，但已经足够形成一个可复用的最小 benchmark 骨架。\n3.4 每条评测 case 的设计原则\n本周不是只记录“问题文本”，而是给每条 case 设计了完整结构。单条 case 至少包含以下字段：\ncase_id\n用于保证 case 可追踪、可比对、可回归。\ncategory\n明确任务类别，避免把推荐类、开发类、排障类混在一起统计，导致指标失真。\nconversation\n必须是多轮结构，而不是单句问答。因为我们本周的目标就是验证多轮补参与恢复，而不是单轮回答质量。\nexpected_signals\n这是本周 benchmark 设计里最关键的一层。它不直接判断最终答案“好不好”，而是先判断系统行为“有没有发生”。例如：\n是否应该 ask for sdk_language；\n是否应该使用 github_search；\n是否应该同时走 discourse_query 与 qdrant_search；\n是否不应该在缺日志时给出过度确定结论。\n这层信号对后续半自动评测非常重要，因为它让我们可以把“工作流行为正确”从“最终表述优雅”里拆出来单独评估。\nsuccess_criteria\n这里记录的是回答级别的成功标准，例如：\n至少包含两个来源；\n给出 JS/TS 优先的路径；\n不能把文档/代码冲突直接武断归因。\n它和 expected_signals 的区别在于：前者看流程动作，后者看回答结果。\n3.5 为什么 benchmark 里同时保留 expected_signals 和 success_criteria\n如果只有 success_criteria，我们只能看最后回答像不像“还行”，但无法知道它是通过正确流程得到的，还是偶然答对的。\n如果只有 expected_signals，我们又只能知道系统“做了动作”，但不能判断最终产出的答案是否真的可用。\n所以本周采用“双层判定”思路：\n第一层：流程层 benchmark\n检查有没有正确 ask_user、有没有正确选 tool、有没有沿线程继续执行。\n第二层：回答层 benchmark\n检查最终回答是否引用充分、是否尊重约束、是否满足任务目标。\n这两层拆开之后，后面出现 bad case 时就能更快定位：\n是 planner 错了；\n是 executor 没调到正确工具；\n还是 composer 在证据足够时仍然总结失真。\n3.6 当前 benchmark 的样例来源与构造策略\n本周这 6 条样例不是随机写出来的，而是围绕当前系统最值得验证的机制构造出来的：\n推荐类样例\n用来测试用户目标在第二轮收缩后，推荐路径是否跟着变化。\n例如从“我想做 demo”进一步细化到“前端集成、JS/TS”。\n开发指导类样例\n用来测试系统是否会在缺语言/版本时追问，并在得到补参后输出更贴近实际工程的步骤或示例来源。\n排障类样例\n用来测试系统在日志不充分、版本冲突、文档与代码不一致时，是否会优先保守地继续收集证据，而不是直接“拍脑袋诊断”。\n本质上，这些 case 优先覆盖的是“高频失败模式”，而不是“知识点覆盖率”。这也是第一版 benchmark 更适合工程迭代的原因。\n3.7 当前评测流程是怎么设计的\n虽然本周还没有把完整的自动评分 runner 做出来，但流程已经定型，后续可以直接接上：\n读取 jsonl benchmark。\n校验 case 结构是否合法。\n按 case 重放多轮 conversation。\n记录每轮 graph 输出，尤其是：\nask_user_question\ntrace_summary\n使用过的 tool\n最终 citations\n是否发生 fallback / error\n先对照 expected_signals 做流程级检查。\n再对照 success_criteria 做结果级检查。\n最后按类别输出通过率、失败样例和失败类型分布。\n后续如果继续扩展，这个流程可以自然演进成：\nrule-based first pass；\nLLM-as-a-judge second pass；\nbad case archive for manual review。\n3.8 为什么本周只做 dataset/validator，而没有直接做最终 benchmark runner\n这是一个取舍问题。当前最缺的是“统一输入格式”，不是“再写一个复杂脚本”。\n如果没有先把 case 结构稳定下来，直接写 runner 很容易导致：\ncase 字段每周都变；\n评测逻辑和数据格式强耦合；\n新增 case 时需要频繁改脚本。\n所以本周先完成的是：\ncase schema 的最小约束；\ncase 分类方式；\ncase 编写原则；\nbenchmark 目录约定；\n自动校验入口。\n这样下一周只需要在这个基础上补 runner 和聚合输出，而不需要推倒重来。\n3.9 当前 benchmark 的局限\n本周也明确看到了第一版评测集的边界：\n样例数仍然偏少，只有 6 条，更多是“骨架验证”而不是“统计充分”。\n目前只做了结构校验，还没有产出真实分数面板。\nexpected_signals 还需要进一步细化成更严格、可自动比对的字段。\n还没有接入真实线上对话日志中的 bad case，当前样例仍以人工构造为主。\n因此，本周的 benchmark 工作更准确地说是“评测基线搭建完成”，而不是“评测体系完成”。\n四、阶段性成果\n本周完成后，系统进入了一个新的阶段：\ndiscourse_query / github_search 不再停留在协议层，而是进入了主路径可执行状态。\n回答的诊断信息开始从“只看最终文本”转向“可看到中间工具行为”。\n多轮 benchmark 已经有了稳定输入格式，后续可以在同一套 case 上持续回归。\nTelegram / Discord 两端在长消息与异常路径上的基本稳定性已经有自动化保护。\n五、当前问题\nWeek 7 的稳定性验证还没有做到真正的线上端到端长对话压测，目前仍以 runtime 级自动化回归为主。\nbenchmark 目前完成的是 dataset + validator，尚未形成自动评分 runner 与分类统计面板。\n当前 benchmark 仍以人工设计 case 为主，尚未系统吸收真实线上 bad case 反哺样例池。\ntrace_summary 已能表达最小工具执行信息，但还没有形成更结构化的统一诊断报告。\n六、下周计划（Week 8）\n在现有 benchmark dataset 基础上补一个评测 runner，输出分类通过率与失败原因汇总。\n继续扩充多轮样例，优先补齐真实 bad case 映射和版本冲突类样本。\n为 Telegram / Discord 增加更接近真实流量的多轮长对话端到端测试。\n继续增强 trace 结构化程度，让 planner / executor / composer 的失败原因更容易定位。\n开始为社区内测做交付准备，包括测试环境梳理、种子用户招募话术、使用说明与问题提交流程整理，确保内测不是“把 Bot 放出去”，而是有明确反馈闭环的可控测试。\n补齐用户评估准备工作，优先推进 CSAT 评分入口、BadCase 自动收集结构、以及一份极简用户问卷草案，保证内测期间不仅能拿到即时星级反馈，也能拿到跨会话的主观体验评价。\n规划第一轮内测观测指标，明确至少要跟踪的问题解决率、平均满意度（CSAT）、响应延迟分布、以及 1​-3​ 对话的复盘优先级，为后续结项报告沉淀真实用户评估数据。\nWeek 7 Report\n1. Weekly Goal (Tool Closure and Evaluation Baseline)\nThis week focused on turning the newly-added multi-turn mechanisms into a more testable and operationally reliable system. The four core goals were:\nReduce runtime log noise and improve minimum observability.\nMove discourse_query / github_search from protocol-only definitions into the real runtime path.\nBuild the first multi-turn evaluation dataset as a benchmark baseline.\nStrengthen Telegram/Discord regression coverage for long-output and failure paths.\n2. Completed Work\nLogging and observability cleanup\nAdded logger quieting controls and surfaced per-tool execution summaries into graph state and final trace summaries.\nRuntime tool-loop coverage expanded\ndiscourse_query and github_search now survive planner normalization, have dedicated runtime handlers, and align with timeout / idempotency / normalized execution behavior.\nBetter answer-path explainability\nTool-level execution failures now map into clearer traceable error states instead of disappearing behind generic failures.\nFirst multi-turn benchmark dataset landed\nAdded a structured jsonl evaluation set plus loader/validator utilities, covering recommendation, development-guidance, and troubleshooting tasks.\nPlatform stability regression improved\nAdded Discord/Telegram fallback-path and long-message segmentation tests to protect key runtime edge cases.\nWeek 7 regression validation\nIn the nervous-brain mamba environment, the Week 7 related suite passed:\n78 passed, 1 warning.\n3. Evaluation Flow and Benchmark Design\nThe most important outcome this week was not just “adding several examples”, but defining the first benchmark methodology for multi-turn system evaluation.\nThis benchmark is designed to test system behavior rather than generic knowledge recall. Its purpose is to verify whether the system:\nasks for missing parameters when required;\nresumes correctly after clarification;\nselects tools according to task type;\nremains conservative under missing logs or conflicting evidence;\nproduces traceable, citation-backed answers.\nThe dataset is divided into three task types:\nsolution_recommendation\ndevelopment_guidance\ntroubleshooting\nEach case contains:\ncase_id for stable tracking;\ncategory for split-level reporting;\nconversation with at least two turns;\nexpected_signals for workflow-level expectations;\nsuccess_criteria for answer-level expectations.\nThis two-layer design is deliberate:\nWorkflow-layer evaluation checks whether the system asked the right follow-up, used the right tools, and continued along the correct thread context.\nAnswer-layer evaluation checks whether the final response is well-grounded, appropriately scoped, and useful for the task.\nThis separation matters because it lets us diagnose whether a failure came from planning, retrieval execution, or answer composition instead of treating every bad answer as the same class of issue.\nThe benchmark construction strategy this week prioritized failure modes over topic breadth. The six seed cases were manually designed around the most important multi-turn risks:\nrecommendation shifts after clarification;\nimplementation guidance after missing language/version follow-up;\ntroubleshooting under incomplete logs;\nversion conflict between docs and code examples.\nThe intended evaluation flow is now clear:\nload benchmark cases from jsonl;\nvalidate schema;\nreplay the conversation turn by turn;\ncollect graph outputs such as tool usage, trace summary, follow-up question, fallback path, and citations;\ncompare against expected_signals;\ncompare final outputs against success_criteria;\naggregate results by category.\nThis week intentionally stopped at dataset + validator rather than overbuilding a scoring runner too early. The reasoning was simple: without a stable input format, any automated benchmark script would be fragile and constantly changing. By fixing the dataset contract first, future runner and dashboard work can build on a stable base.\n4. Current Gaps\nStability verification is still regression-oriented, not full online end-to-end long-dialogue load testing.\nThe benchmark currently provides dataset + validation, but not a full scoring runner yet.\nThe sample pool is still small and mostly manually curated.\nTrace summaries are more useful now, but not yet a full structured diagnostic report.\n5. Plan for Week 8\nBuild a benchmark runner with category-level pass-rate outputs.\nExpand the dataset, especially with real bad cases and version-conflict samples.\nAdd more realistic end-to-end multi-turn runtime tests for Telegram/Discord.\nContinue improving structured diagnostics across planner / executor / composer stages.",
          "content_html": "<h1><a name=\"p-24056-h-1\" class=\"anchor\" href=\"#p-24056-h-1\" aria-label=\"Heading link\"></a>第七周周报</h1>\n<h2><a name=\"p-24056-h-2\" class=\"anchor\" href=\"#p-24056-h-2\" aria-label=\"Heading link\"></a>一、本周目标（工具闭环与评测基线周）</h2>\n<p>本周承接第六周“多轮可持续交互”阶段的工作，重点从“机制已经具备”推进到“关键路径真正闭环、且后续可以被稳定评测”。核心目标有四个：</p>\n<ol>\n<li>继续治理运行时日志噪音，补齐最小可观测性闭环。</li>\n<li>让 <code>discourse_query</code> / <code>github_search</code> 从协议层定义走到图执行主路径可调用。</li>\n<li>建立第一版多轮评测集，为后续 benchmark 和量化回归提供统一输入。</li>\n<li>补齐 Telegram / Discord 两端在长消息与异常路径下的稳定性回归。</li>\n</ol>\n<h2><a name=\"p-24056-h-3\" class=\"anchor\" href=\"#p-24056-h-3\" aria-label=\"Heading link\"></a>二、本周完成</h2>\n<ol>\n<li>\n<p>日志治理与诊断视图补齐<br>\n已对常见第三方库日志进行了统一降噪处理，补充了 <code>quiet_loggers</code> 与 <code>third_party_level</code> 控制项，降低了测试和运行期的无关日志干扰。同时把工具执行过程的摘要信息接入到图状态与 <code>trace_summary</code> 中，使得一次回答至少可以追踪到“执行了哪些 tool、各自成功/为空/失败”的最小诊断视图，而不再只能看到最终回答文本。</p>\n</li>\n<li>\n<p>工具执行闭环推进到运行时主路径<br>\n本周把 <code>discourse_query</code> 与 <code>github_search</code> 从“schema 已存在但主路径未接通”的状态推进到了可被 <code>RetrieverPlanner -&gt; RetrievalExecutor -&gt; ToolRuntime</code> 实际调用的状态。具体包括：</p>\n<ul>\n<li><code>RetrieverPlanner</code> 归一化阶段已允许这两个 tool 保留，不再强制回退到 <code>qdrant_search</code>。</li>\n<li><code>ToolRuntime</code> 为二者补齐了 handler。</li>\n<li>handler 采用“transport 优先、本地 archive fallback 兜底”的策略：若存在外部 transport，则优先通过 transport 执行；若 transport 不可用，则回退到本地 archive/BM25 查询，保证评测和离线回归场景下依旧能走通闭环。</li>\n<li>timeout / idempotency / error normalization 已统一复用 <code>execute_tool</code>，不再为新增 tool 走特殊旁路逻辑。</li>\n</ul>\n</li>\n<li>\n<p>回答链路的可解释性继续增强<br>\n在原有“回答生成稳态与兜底能力增强”的基础上，本周补充了 tool 级别的错误码与执行摘要，使回答不仅能在失败时兜底，也能在 trace 中解释失败原因。例如工具执行异常会落到统一错误结构中，而不是静默吞掉。这样后续排查时，可以区分“模型答错”“检索为空”“工具执行异常”“预算截断”等不同问题来源。</p>\n</li>\n<li>\n<p>多轮评测集第一版落地<br>\n已新增 <code>evaluation/week7_multiturn_eval.jsonl</code> 作为第一版多轮 benchmark 输入集，并新增配套的 loader / validator，确保每条 case 至少满足：</p>\n<ul>\n<li>有稳定 <code>case_id</code>；</li>\n<li>属于明确任务类别；</li>\n<li>至少包含两轮以上对话；</li>\n<li>带有 <code>success_criteria</code>；</li>\n<li>带有 <code>expected_signals</code> 用于后续自动比对。</li>\n</ul>\n</li>\n<li>\n<p>平台稳定性回归增强<br>\nTelegram / Discord 两端都补充了：</p>\n<ul>\n<li>graph runner 异常时的 fallback 测试；</li>\n<li>超长回答的分段发送测试；</li>\n<li>与 full graph 一致的输出适配回归。<br>\n这部分虽然还没有做到真正的在线压测，但已把“长文本切分”和“异常不崩溃”这两条高风险路径先用自动化测试钉住。</li>\n</ul>\n</li>\n<li>\n<p>Week 7 相关回归验证通过<br>\n本周相关测试，结果为：<br>\n<code>78 passed, 1 warning</code><br>\n覆盖范围包括 tool runtime、graph executor、logging system、Discord/TG runtime 以及 evaluation dataset 验证。</p>\n</li>\n</ol>\n<h2><a name=\"p-24056-benchmark-4\" class=\"anchor\" href=\"#p-24056-benchmark-4\" aria-label=\"Heading link\"></a>三、本周重点：评测流程与 Benchmark 构建思路</h2>\n<p>本周最重要的新工作，不是单纯“加了几条样例”，而是把后续多轮评测的基本方法论先定下来。这里单独展开说明。</p>\n<h3><a name=\"p-24056-h-31-5\" class=\"anchor\" href=\"#p-24056-h-31-5\" aria-label=\"Heading link\"></a>3.1 为什么这一周先做评测集，而不是直接做分数面板</h3>\n<p>当前项目虽然已经具备多轮补参、反思分流、回答兜底等机制，但如果没有稳定 benchmark，任何关于“效果更好了”的结论都只能靠主观感受。尤其是多轮系统，很容易出现下面几类错觉：</p>\n<ol>\n<li>看起来会追问了，但追问的问题不一定对。</li>\n<li>看起来会恢复上下文了，但恢复后不一定真的沿着原问题推进。</li>\n<li>看起来工具更多了，但新工具可能并没有真正参与主路径决策。</li>\n<li>看起来回答更长了，但未必更可引用、更稳定。</li>\n</ol>\n<p>因此本周没有直接上“评分 dashboard”，而是先做 benchmark 输入层。原因很简单：没有稳定输入，就没有稳定指标；没有稳定 case，任何分数都不可复现。</p>\n<h3><a name=\"p-24056-h-32-benchmark-6\" class=\"anchor\" href=\"#p-24056-h-32-benchmark-6\" aria-label=\"Heading link\"></a>3.2 Benchmark 的目标不是“考模型常识”，而是考系统闭环能力</h3>\n<p>本周设计 benchmark 时，刻意没有把重点放在开放式知识问答上，而是围绕 Nervos Brain 当前最关键的系统能力来建样例。也就是说，这份 benchmark 的目标不是测试“模型懂不懂 CKB”，而是测试下面这些系统行为是否发生：</p>\n<ol>\n<li>是否会在缺少必要参数时进行合理追问。</li>\n<li>追问后是否能保留线程上下文并继续执行，而不是重新开始。</li>\n<li>是否会根据任务类型选择合适的 tool，而不是永远只走 <code>qdrant_search</code>。</li>\n<li>面对文档与代码冲突、日志不完整、用户目标不明确等真实场景时，是否会做出保守且可解释的决策。</li>\n<li>最终回答是否带引用、是否避免过度编造、是否体现任务导向。</li>\n</ol>\n<p>换句话说，这份 benchmark 更像是“系统工作流 benchmark”，而不是“百科知识 benchmark”。</p>\n<h3><a name=\"p-24056-h-33-7\" class=\"anchor\" href=\"#p-24056-h-33-7\" aria-label=\"Heading link\"></a>3.3 为什么选这三类任务</h3>\n<p>本周评测集按三类任务拆分：</p>\n<ol>\n<li><code>solution_recommendation</code>（方案推荐）</li>\n<li><code>development_guidance</code>（开发指导）</li>\n<li><code>troubleshooting</code>（排障定位）</li>\n</ol>\n<p>这样拆分的原因是，这三类任务对应了系统最典型、也最容易出错的三种工作模式。</p>\n<p><code>solution_recommendation</code> 关注的是：</p>\n<ul>\n<li>用户目标还比较模糊；</li>\n<li>系统要先帮助缩小问题空间；</li>\n<li>容易因为用户补充信息变化而改写推荐路径。</li>\n</ul>\n<p><code>development_guidance</code> 关注的是：</p>\n<ul>\n<li>用户往往已经有明确目标，但缺技术细节；</li>\n<li>系统要在检索、补参、代码示例、分阶段步骤之间做平衡；</li>\n<li>很适合验证 AskUser → resume → answer 的闭环。</li>\n</ul>\n<p><code>troubleshooting</code> 关注的是：</p>\n<ul>\n<li>用户信息往往不完整；</li>\n<li>证据冲突与日志缺失更常见；</li>\n<li>系统必须优先保守，不应过早武断下结论。</li>\n</ul>\n<p>这三类覆盖面并不等于项目全部任务，但已经足够形成一个可复用的最小 benchmark 骨架。</p>\n<h3><a name=\"p-24056-h-34-case-8\" class=\"anchor\" href=\"#p-24056-h-34-case-8\" aria-label=\"Heading link\"></a>3.4 每条评测 case 的设计原则</h3>\n<p>本周不是只记录“问题文本”，而是给每条 case 设计了完整结构。单条 case 至少包含以下字段：</p>\n<ol>\n<li>\n<p><code>case_id</code><br>\n用于保证 case 可追踪、可比对、可回归。</p>\n</li>\n<li>\n<p><code>category</code><br>\n明确任务类别，避免把推荐类、开发类、排障类混在一起统计，导致指标失真。</p>\n</li>\n<li>\n<p><code>conversation</code><br>\n必须是多轮结构，而不是单句问答。因为我们本周的目标就是验证多轮补参与恢复，而不是单轮回答质量。</p>\n</li>\n<li>\n<p><code>expected_signals</code><br>\n这是本周 benchmark 设计里最关键的一层。它不直接判断最终答案“好不好”，而是先判断系统行为“有没有发生”。例如：</p>\n<ul>\n<li>是否应该 ask for <code>sdk_language</code>；</li>\n<li>是否应该使用 <code>github_search</code>；</li>\n<li>是否应该同时走 <code>discourse_query</code> 与 <code>qdrant_search</code>；</li>\n<li>是否不应该在缺日志时给出过度确定结论。<br>\n这层信号对后续半自动评测非常重要，因为它让我们可以把“工作流行为正确”从“最终表述优雅”里拆出来单独评估。</li>\n</ul>\n</li>\n<li>\n<p><code>success_criteria</code><br>\n这里记录的是回答级别的成功标准，例如：</p>\n<ul>\n<li>至少包含两个来源；</li>\n<li>给出 JS/TS 优先的路径；</li>\n<li>不能把文档/代码冲突直接武断归因。<br>\n它和 <code>expected_signals</code> 的区别在于：前者看流程动作，后者看回答结果。</li>\n</ul>\n</li>\n</ol>\n<h3><a name=\"p-24056-h-35-benchmark-expected_signals-success_criteria-9\" class=\"anchor\" href=\"#p-24056-h-35-benchmark-expected_signals-success_criteria-9\" aria-label=\"Heading link\"></a>3.5 为什么 benchmark 里同时保留 <code>expected_signals</code> 和 <code>success_criteria</code></h3>\n<p>如果只有 <code>success_criteria</code>，我们只能看最后回答像不像“还行”，但无法知道它是通过正确流程得到的，还是偶然答对的。<br>\n如果只有 <code>expected_signals</code>，我们又只能知道系统“做了动作”，但不能判断最终产出的答案是否真的可用。</p>\n<p>所以本周采用“双层判定”思路：</p>\n<ol>\n<li>\n<p>第一层：流程层 benchmark<br>\n检查有没有正确 ask_user、有没有正确选 tool、有没有沿线程继续执行。</p>\n</li>\n<li>\n<p>第二层：回答层 benchmark<br>\n检查最终回答是否引用充分、是否尊重约束、是否满足任务目标。</p>\n</li>\n</ol>\n<p>这两层拆开之后，后面出现 bad case 时就能更快定位：</p>\n<ul>\n<li>是 planner 错了；</li>\n<li>是 executor 没调到正确工具；</li>\n<li>还是 composer 在证据足够时仍然总结失真。</li>\n</ul>\n<h3><a name=\"p-24056-h-36-benchmark-10\" class=\"anchor\" href=\"#p-24056-h-36-benchmark-10\" aria-label=\"Heading link\"></a>3.6 当前 benchmark 的样例来源与构造策略</h3>\n<p>本周这 6 条样例不是随机写出来的，而是围绕当前系统最值得验证的机制构造出来的：</p>\n<ol>\n<li>\n<p>推荐类样例<br>\n用来测试用户目标在第二轮收缩后，推荐路径是否跟着变化。<br>\n例如从“我想做 demo”进一步细化到“前端集成、JS/TS”。</p>\n</li>\n<li>\n<p>开发指导类样例<br>\n用来测试系统是否会在缺语言/版本时追问，并在得到补参后输出更贴近实际工程的步骤或示例来源。</p>\n</li>\n<li>\n<p>排障类样例<br>\n用来测试系统在日志不充分、版本冲突、文档与代码不一致时，是否会优先保守地继续收集证据，而不是直接“拍脑袋诊断”。</p>\n</li>\n</ol>\n<p>本质上，这些 case 优先覆盖的是“高频失败模式”，而不是“知识点覆盖率”。这也是第一版 benchmark 更适合工程迭代的原因。</p>\n<h3><a name=\"p-24056-h-37-11\" class=\"anchor\" href=\"#p-24056-h-37-11\" aria-label=\"Heading link\"></a>3.7 当前评测流程是怎么设计的</h3>\n<p>虽然本周还没有把完整的自动评分 runner 做出来，但流程已经定型，后续可以直接接上：</p>\n<ol>\n<li>读取 <code>jsonl</code> benchmark。</li>\n<li>校验 case 结构是否合法。</li>\n<li>按 case 重放多轮 <code>conversation</code>。</li>\n<li>记录每轮 graph 输出，尤其是：\n<ul>\n<li><code>ask_user_question</code></li>\n<li><code>trace_summary</code></li>\n<li>使用过的 tool</li>\n<li>最终 citations</li>\n<li>是否发生 fallback / error</li>\n</ul>\n</li>\n<li>先对照 <code>expected_signals</code> 做流程级检查。</li>\n<li>再对照 <code>success_criteria</code> 做结果级检查。</li>\n<li>最后按类别输出通过率、失败样例和失败类型分布。</li>\n</ol>\n<p>后续如果继续扩展，这个流程可以自然演进成：</p>\n<ul>\n<li>rule-based first pass；</li>\n<li>LLM-as-a-judge second pass；</li>\n<li>bad case archive for manual review。</li>\n</ul>\n<h3><a name=\"p-24056-h-38-datasetvalidator-benchmark-runner-12\" class=\"anchor\" href=\"#p-24056-h-38-datasetvalidator-benchmark-runner-12\" aria-label=\"Heading link\"></a>3.8 为什么本周只做 dataset/validator，而没有直接做最终 benchmark runner</h3>\n<p>这是一个取舍问题。当前最缺的是“统一输入格式”，不是“再写一个复杂脚本”。</p>\n<p>如果没有先把 case 结构稳定下来，直接写 runner 很容易导致：</p>\n<ul>\n<li>case 字段每周都变；</li>\n<li>评测逻辑和数据格式强耦合；</li>\n<li>新增 case 时需要频繁改脚本。</li>\n</ul>\n<p>所以本周先完成的是：</p>\n<ul>\n<li>case schema 的最小约束；</li>\n<li>case 分类方式；</li>\n<li>case 编写原则；</li>\n<li>benchmark 目录约定；</li>\n<li>自动校验入口。</li>\n</ul>\n<p>这样下一周只需要在这个基础上补 runner 和聚合输出，而不需要推倒重来。</p>\n<h3><a name=\"p-24056-h-39-benchmark-13\" class=\"anchor\" href=\"#p-24056-h-39-benchmark-13\" aria-label=\"Heading link\"></a>3.9 当前 benchmark 的局限</h3>\n<p>本周也明确看到了第一版评测集的边界：</p>\n<ol>\n<li>样例数仍然偏少，只有 6 条，更多是“骨架验证”而不是“统计充分”。</li>\n<li>目前只做了结构校验，还没有产出真实分数面板。</li>\n<li><code>expected_signals</code> 还需要进一步细化成更严格、可自动比对的字段。</li>\n<li>还没有接入真实线上对话日志中的 bad case，当前样例仍以人工构造为主。</li>\n</ol>\n<p>因此，本周的 benchmark 工作更准确地说是“评测基线搭建完成”，而不是“评测体系完成”。</p>\n<h2><a name=\"p-24056-h-14\" class=\"anchor\" href=\"#p-24056-h-14\" aria-label=\"Heading link\"></a>四、阶段性成果</h2>\n<p>本周完成后，系统进入了一个新的阶段：</p>\n<ol>\n<li><code>discourse_query</code> / <code>github_search</code> 不再停留在协议层，而是进入了主路径可执行状态。</li>\n<li>回答的诊断信息开始从“只看最终文本”转向“可看到中间工具行为”。</li>\n<li>多轮 benchmark 已经有了稳定输入格式，后续可以在同一套 case 上持续回归。</li>\n<li>Telegram / Discord 两端在长消息与异常路径上的基本稳定性已经有自动化保护。</li>\n</ol>\n<h2><a name=\"p-24056-h-15\" class=\"anchor\" href=\"#p-24056-h-15\" aria-label=\"Heading link\"></a>五、当前问题</h2>\n<ol>\n<li>Week 7 的稳定性验证还没有做到真正的线上端到端长对话压测，目前仍以 runtime 级自动化回归为主。</li>\n<li>benchmark 目前完成的是 dataset + validator，尚未形成自动评分 runner 与分类统计面板。</li>\n<li>当前 benchmark 仍以人工设计 case 为主，尚未系统吸收真实线上 bad case 反哺样例池。</li>\n<li><code>trace_summary</code> 已能表达最小工具执行信息，但还没有形成更结构化的统一诊断报告。</li>\n</ol>\n<h2><a name=\"p-24056-week-8-16\" class=\"anchor\" href=\"#p-24056-week-8-16\" aria-label=\"Heading link\"></a>六、下周计划（Week 8）</h2>\n<ol>\n<li>在现有 benchmark dataset 基础上补一个评测 runner，输出分类通过率与失败原因汇总。</li>\n<li>继续扩充多轮样例，优先补齐真实 bad case 映射和版本冲突类样本。</li>\n<li>为 Telegram / Discord 增加更接近真实流量的多轮长对话端到端测试。</li>\n<li>继续增强 trace 结构化程度，让 planner / executor / composer 的失败原因更容易定位。</li>\n<li>开始为社区内测做交付准备，包括测试环境梳理、种子用户招募话术、使用说明与问题提交流程整理，确保内测不是“把 Bot 放出去”，而是有明确反馈闭环的可控测试。</li>\n<li>补齐用户评估准备工作，优先推进 CSAT 评分入口、BadCase 自动收集结构、以及一份极简用户问卷草案，保证内测期间不仅能拿到即时星级反馈，也能拿到跨会话的主观体验评价。</li>\n<li>规划第一轮内测观测指标，明确至少要跟踪的问题解决率、平均满意度（CSAT）、响应延迟分布、以及 1​<img src=\"https://talk.nervos.org/images/emoji/apple/star.png?v=15\" title=\":star:\" class=\"emoji\" alt=\":star:\" loading=\"lazy\" width=\"20\" height=\"20\">-3​<img src=\"https://talk.nervos.org/images/emoji/apple/star.png?v=15\" title=\":star:\" class=\"emoji\" alt=\":star:\" loading=\"lazy\" width=\"20\" height=\"20\"> 对话的复盘优先级，为后续结项报告沉淀真实用户评估数据。</li>\n</ol>\n<hr>\n<h1><a name=\"p-24056-week-7-report-17\" class=\"anchor\" href=\"#p-24056-week-7-report-17\" aria-label=\"Heading link\"></a>Week 7 Report</h1>\n<h2><a name=\"p-24056-h-1-weekly-goal-tool-closure-and-evaluation-baseline-18\" class=\"anchor\" href=\"#p-24056-h-1-weekly-goal-tool-closure-and-evaluation-baseline-18\" aria-label=\"Heading link\"></a>1. Weekly Goal (Tool Closure and Evaluation Baseline)</h2>\n<p>This week focused on turning the newly-added multi-turn mechanisms into a more testable and operationally reliable system. The four core goals were:</p>\n<ol>\n<li>Reduce runtime log noise and improve minimum observability.</li>\n<li>Move <code>discourse_query</code> / <code>github_search</code> from protocol-only definitions into the real runtime path.</li>\n<li>Build the first multi-turn evaluation dataset as a benchmark baseline.</li>\n<li>Strengthen Telegram/Discord regression coverage for long-output and failure paths.</li>\n</ol>\n<h2><a name=\"p-24056-h-2-completed-work-19\" class=\"anchor\" href=\"#p-24056-h-2-completed-work-19\" aria-label=\"Heading link\"></a>2. Completed Work</h2>\n<ol>\n<li>\n<p>Logging and observability cleanup<br>\nAdded logger quieting controls and surfaced per-tool execution summaries into graph state and final trace summaries.</p>\n</li>\n<li>\n<p>Runtime tool-loop coverage expanded<br>\n<code>discourse_query</code> and <code>github_search</code> now survive planner normalization, have dedicated runtime handlers, and align with timeout / idempotency / normalized execution behavior.</p>\n</li>\n<li>\n<p>Better answer-path explainability<br>\nTool-level execution failures now map into clearer traceable error states instead of disappearing behind generic failures.</p>\n</li>\n<li>\n<p>First multi-turn benchmark dataset landed<br>\nAdded a structured <code>jsonl</code> evaluation set plus loader/validator utilities, covering recommendation, development-guidance, and troubleshooting tasks.</p>\n</li>\n<li>\n<p>Platform stability regression improved<br>\nAdded Discord/Telegram fallback-path and long-message segmentation tests to protect key runtime edge cases.</p>\n</li>\n<li>\n<p>Week 7 regression validation<br>\nIn the <code>nervous-brain</code> mamba environment, the Week 7 related suite passed:<br>\n<code>78 passed, 1 warning</code>.</p>\n</li>\n</ol>\n<h2><a name=\"p-24056-h-3-evaluation-flow-and-benchmark-design-20\" class=\"anchor\" href=\"#p-24056-h-3-evaluation-flow-and-benchmark-design-20\" aria-label=\"Heading link\"></a>3. Evaluation Flow and Benchmark Design</h2>\n<p>The most important outcome this week was not just “adding several examples”, but defining the first benchmark methodology for multi-turn system evaluation.</p>\n<p>This benchmark is designed to test system behavior rather than generic knowledge recall. Its purpose is to verify whether the system:</p>\n<ol>\n<li>asks for missing parameters when required;</li>\n<li>resumes correctly after clarification;</li>\n<li>selects tools according to task type;</li>\n<li>remains conservative under missing logs or conflicting evidence;</li>\n<li>produces traceable, citation-backed answers.</li>\n</ol>\n<p>The dataset is divided into three task types:</p>\n<ol>\n<li><code>solution_recommendation</code></li>\n<li><code>development_guidance</code></li>\n<li><code>troubleshooting</code></li>\n</ol>\n<p>Each case contains:</p>\n<ol>\n<li><code>case_id</code> for stable tracking;</li>\n<li><code>category</code> for split-level reporting;</li>\n<li><code>conversation</code> with at least two turns;</li>\n<li><code>expected_signals</code> for workflow-level expectations;</li>\n<li><code>success_criteria</code> for answer-level expectations.</li>\n</ol>\n<p>This two-layer design is deliberate:</p>\n<ol>\n<li>Workflow-layer evaluation checks whether the system asked the right follow-up, used the right tools, and continued along the correct thread context.</li>\n<li>Answer-layer evaluation checks whether the final response is well-grounded, appropriately scoped, and useful for the task.</li>\n</ol>\n<p>This separation matters because it lets us diagnose whether a failure came from planning, retrieval execution, or answer composition instead of treating every bad answer as the same class of issue.</p>\n<p>The benchmark construction strategy this week prioritized failure modes over topic breadth. The six seed cases were manually designed around the most important multi-turn risks:</p>\n<ol>\n<li>recommendation shifts after clarification;</li>\n<li>implementation guidance after missing language/version follow-up;</li>\n<li>troubleshooting under incomplete logs;</li>\n<li>version conflict between docs and code examples.</li>\n</ol>\n<p>The intended evaluation flow is now clear:</p>\n<ol>\n<li>load benchmark cases from <code>jsonl</code>;</li>\n<li>validate schema;</li>\n<li>replay the conversation turn by turn;</li>\n<li>collect graph outputs such as tool usage, trace summary, follow-up question, fallback path, and citations;</li>\n<li>compare against <code>expected_signals</code>;</li>\n<li>compare final outputs against <code>success_criteria</code>;</li>\n<li>aggregate results by category.</li>\n</ol>\n<p>This week intentionally stopped at dataset + validator rather than overbuilding a scoring runner too early. The reasoning was simple: without a stable input format, any automated benchmark script would be fragile and constantly changing. By fixing the dataset contract first, future runner and dashboard work can build on a stable base.</p>\n<h2><a name=\"p-24056-h-4-current-gaps-21\" class=\"anchor\" href=\"#p-24056-h-4-current-gaps-21\" aria-label=\"Heading link\"></a>4. Current Gaps</h2>\n<ol>\n<li>Stability verification is still regression-oriented, not full online end-to-end long-dialogue load testing.</li>\n<li>The benchmark currently provides dataset + validation, but not a full scoring runner yet.</li>\n<li>The sample pool is still small and mostly manually curated.</li>\n<li>Trace summaries are more useful now, but not yet a full structured diagnostic report.</li>\n</ol>\n<h2><a name=\"p-24056-h-5-plan-for-week-8-22\" class=\"anchor\" href=\"#p-24056-h-5-plan-for-week-8-22\" aria-label=\"Heading link\"></a>5. Plan for Week 8</h2>\n<ol>\n<li>Build a benchmark runner with category-level pass-rate outputs.</li>\n<li>Expand the dataset, especially with real bad cases and version-conflict samples.</li>\n<li>Add more realistic end-to-end multi-turn runtime tests for Telegram/Discord.</li>\n<li>Continue improving structured diagnostics across planner / executor / composer stages.</li>\n</ol>",
          "like_count": 0,
          "quote_count": 0
        },
        {
          "post_id": 24058,
          "post_number": 32,
          "topic_id": 9995,
          "topic_title": "Spark Program | Nervos Brain - A Global Developer Onboarding Engine and Cross-Language Hub Powered by Agentic RAG",
          "topic_slug": "spark-program-nervos-brain-a-global-developer-onboarding-engine-and-cross-language-hub-powered-by-agentic-rag",
          "author": "IrisNeko",
          "created_at": "2026-04-28T12:00:25.555000+00:00",
          "updated_at": "2026-04-28T12:00:25.555000+00:00",
          "reply_to_post_number": 30,
          "url": "https://talk.nervos.org/t/spark-program-nervos-brain-a-global-developer-onboarding-engine-and-cross-language-hub-powered-by-agentic-rag/9995/32",
          "content_text": "感谢您的建议，\n我上周对系统做了基础的评测，并计划在这周拉一个Telegram试用群，邀请委员会的成员提前体验。同时欢迎在体验中提出建议，来帮助我改善系统。\nBest regards.",
          "content_html": "<p>感谢您的建议，</p>\n<p>我上周对系统做了基础的评测，并计划在这周拉一个Telegram试用群，邀请委员会的成员提前体验。同时欢迎在体验中提出建议，来帮助我改善系统。</p>\n<p>Best regards.</p>",
          "like_count": 0,
          "quote_count": 0
        },
        {
          "post_id": 24059,
          "post_number": 33,
          "topic_id": 9995,
          "topic_title": "Spark Program | Nervos Brain - A Global Developer Onboarding Engine and Cross-Language Hub Powered by Agentic RAG",
          "topic_slug": "spark-program-nervos-brain-a-global-developer-onboarding-engine-and-cross-language-hub-powered-by-agentic-rag",
          "author": "zz_tovarishch",
          "created_at": "2026-04-28T14:03:27.779000+00:00",
          "updated_at": "2026-04-28T14:03:27.779000+00:00",
          "reply_to_post_number": 32,
          "url": "https://talk.nervos.org/t/spark-program-nervos-brain-a-global-developer-onboarding-engine-and-cross-language-hub-powered-by-agentic-rag/9995/33",
          "content_text": "Hi IrisNeko, 目前论坛已经接入AI翻译工具，Spark不再强制要求项目在Talk上沉淀的内容需采用双语版本\n期待项目的持续发展！",
          "content_html": "<p>Hi IrisNeko, 目前论坛已经接入AI翻译工具，Spark不再强制要求项目在Talk上沉淀的内容需采用双语版本</p>\n<p>期待项目的持续发展！</p>",
          "like_count": 0,
          "quote_count": 0
        }
      ]
    },
    {
      "topic_id": 10204,
      "title": "Discontinuation of the DAO v1.1 project",
      "slug": "discontinuation-of-the-dao-v1-1-project",
      "url": "https://talk.nervos.org/t/discontinuation-of-the-dao-v1-1-project/10204",
      "created_at": "2026-04-23T05:07:43.422000+00:00",
      "last_posted_at": "2026-04-28T11:56:27.570000+00:00",
      "category_id": 40,
      "tags": [],
      "posters": [
        "Original Poster, Most Recent Poster",
        "Frequent Poster",
        "Frequent Poster",
        "Frequent Poster",
        "Frequent Poster"
      ],
      "recent_posts": [
        {
          "post_id": 24057,
          "post_number": 11,
          "topic_id": 10204,
          "topic_title": "Discontinuation of the DAO v1.1 project",
          "topic_slug": "discontinuation-of-the-dao-v1-1-project",
          "author": "_magicsheep",
          "created_at": "2026-04-28T11:56:27.570000+00:00",
          "updated_at": "2026-04-28T11:56:27.570000+00:00",
          "reply_to_post_number": 7,
          "url": "https://talk.nervos.org/t/discontinuation-of-the-dao-v1-1-project/10204/11",
          "content_text": "In consideration of Terry’s advice, the following updates are provided regarding the closure of DAO v1.1:\nPayment: The proposal team will retain the payment corresponding to the already‑delivered Milestone 1.\nCode: The code will remain open source and is accessible at this repository. A total of eight repositories encompass all code for the DAO v1.1 platform (excluding Web5 services, which are available here). Additionally, the community may access the open‑source vote auditor tool here.\nContract: The voting contract deployed on the mainnet has been terminated, whereas the testnet voting contract continues to operate. The did:ckb contracts remain active on both the mainnet and the testnet.\nServer: Servers supporting the current DAO v1.1 platform will be decommissioned shortly. Following this action, services for both the mainnet and the testnet will be terminated.\nDomain name: Domain resolution for ccfdao.dev and ccfdao.org will be terminated in the near future. Consequently, the documentation website at https://docs.ccfdao.org/ will also be shut down. However, relevant documentation can be found within the source code repository accessible here.\nThis marks the official conclusion of the DAO v1.1 project. The proposal team wishes to once again express its gratitude to all those who provided constructive feedback and support. May the community identify a suitable governance model at a future date. Thank you",
          "content_html": "<p>In consideration of Terry’s advice, the following updates are provided regarding the closure of DAO v1.1:</p>\n<ol>\n<li>\n<p><strong>Payment</strong>: The proposal team will retain the payment corresponding to the already‑delivered Milestone 1.</p>\n</li>\n<li>\n<p><strong>Code</strong>: The code will remain open source and is accessible at <a href=\"https://github.com/CCF-DAO1-1\" rel=\"noopener nofollow ugc\">this repository</a>. A total of eight repositories encompass all code for the DAO v1.1 platform (excluding Web5 services, which are available <a href=\"https://github.com/web5fans\" rel=\"noopener nofollow ugc\">here</a>). Additionally, the community may access the open‑source vote auditor tool <a href=\"https://github.com/CCF-DAO1-1/ccfdao-vote-auditor-rfc\" rel=\"noopener nofollow ugc\">here</a>.</p>\n</li>\n<li>\n<p><strong>Contract</strong>: The voting contract deployed on the mainnet has been terminated, whereas the testnet voting contract continues to operate. The <code>did:ckb</code> contracts remain active on both the mainnet and the testnet.</p>\n</li>\n<li>\n<p><strong>Server</strong>: Servers supporting the current DAO v1.1 platform will be decommissioned shortly. Following this action, services for both the mainnet and the testnet will be terminated.</p>\n</li>\n<li>\n<p><strong>Domain name</strong>: Domain resolution for <code>ccfdao.dev</code> and <code>ccfdao.org</code> will be terminated in the near future. Consequently, the documentation website at <a href=\"https://docs.ccfdao.org/\" rel=\"noopener nofollow ugc\">https://docs.ccfdao.org/</a> will also be shut down. However, relevant documentation can be found within the source code repository accessible <a href=\"https://github.com/CCF-DAO1-1/ccfdao-v1.1-docs\" rel=\"noopener nofollow ugc\">here</a>.</p>\n</li>\n</ol>\n<p>This marks the official conclusion of the DAO v1.1 project. The proposal team wishes to once again express its gratitude to all those who provided constructive feedback and support. May the community identify a suitable governance model at a future date. Thank you <img src=\"https://talk.nervos.org/images/emoji/apple/slight_smile.png?v=15\" title=\":slight_smile:\" class=\"emoji\" alt=\":slight_smile:\" loading=\"lazy\" width=\"20\" height=\"20\"></p>",
          "like_count": 0,
          "quote_count": 0
        }
      ]
    },
    {
      "topic_id": 10199,
      "title": "Cellora — designing a production indexing and query service for CKB (feedback welcome)",
      "slug": "cellora-designing-a-production-indexing-and-query-service-for-ckb-feedback-welcome",
      "url": "https://talk.nervos.org/t/cellora-designing-a-production-indexing-and-query-service-for-ckb-feedback-welcome/10199",
      "created_at": "2026-04-22T15:33:26.290000+00:00",
      "last_posted_at": "2026-04-28T08:40:25.015000+00:00",
      "category_id": 32,
      "tags": [
        "CKB",
        "Nervos-项目动态",
        "dapp",
        "testnet"
      ],
      "posters": [
        "Original Poster",
        "Frequent Poster",
        "Most Recent Poster"
      ],
      "recent_posts": [
        {
          "post_id": 24055,
          "post_number": 4,
          "topic_id": 10199,
          "topic_title": "Cellora — designing a production indexing and query service for CKB (feedback welcome)",
          "topic_slug": "cellora-designing-a-production-indexing-and-query-service-for-ckb-feedback-welcome",
          "author": "ArthurZhang",
          "created_at": "2026-04-28T08:40:25.015000+00:00",
          "updated_at": "2026-04-28T08:51:09.950000+00:00",
          "reply_to_post_number": 3,
          "url": "https://talk.nervos.org/t/cellora-designing-a-production-indexing-and-query-service-for-ckb-feedback-welcome/10199/4",
          "content_text": "Just came across this thread and found it interesting, so I’ll try to offer a few suggestions. I think the honest answer is:\nFor tx inclusion proofs, the practical first step is likely not Flyclient, but exposing CKB’s existing get_transaction_proof / verify_transaction_proof path through Cellora. That lets clients verify that a transaction is committed under a particular block header, rather than merely trusting Cellora’s indexed result. This easily moves Cellora from a purely trusted indexer toward an inclusion-verifiable indexer.\nFor full historical / chain-tip trust minimisation, my narrower point is that I’m not sure there is a canonical Rust/TS wallet-side verifier package that app developers can just plug into today. So for Cellora v1, I’d probably keep MMR/Flyclient-style support as a later integration layer, not a hard requirement.",
          "content_html": "<p>Just came across this thread and found it interesting, so I’ll try to offer a few suggestions. I think the honest answer is:</p>\n<p><strong>For tx inclusion proofs</strong>, the practical first step is likely <strong>not Flyclient</strong>, but exposing CKB’s existing <code>get_transaction_proof</code> / <code>verify_transaction_proof</code> path through Cellora. That lets clients verify that a transaction is committed under a particular block header, rather than merely trusting Cellora’s indexed result. This easily moves Cellora from a purely trusted indexer toward an inclusion-verifiable indexer.</p>\n<p><strong>For full historical / chain-tip trust minimisation,</strong> my narrower point is that I’m not sure there is a canonical Rust/TS wallet-side verifier package that app developers can just plug into today. So for Cellora v1, I’d probably keep MMR/Flyclient-style support as a later integration layer, not a hard requirement.</p>",
          "like_count": 0,
          "quote_count": 0
        }
      ]
    }
  ]
}