design_doc · markdown

Completion Verification — Runner-Owned Task Governance

Completion Verification — Runner Owned Task Governance Date: 2026 04 24 Status: Design Backlog: completion verification Problem Workers currently post task.complete themselves via the hedera tools MCP. The HCS topic receives completions before anything can validate them. The Engine advances workflow state based on the claim alone — there is no verification t...

Date: 2026-04-24 Status: Design Backlog: completion-verification

---

Problem

Workers currently post task.complete themselves via the hedera-tools MCP. The HCS topic receives completions before anything can validate them. The Engine advances workflow state based on the claim alone — there is no verification that work was actually done, correctly, or at all.

This creates three failure modes: 1. Worker produces no output (bug, timeout, misunderstood brief) 2. Worker produces output that exists but is wrong (technically valid, semantically broken) 3. Worker produces output that partially addresses the task (missing criteria)

All three advance the workflow as if complete.

---

Core Principle

The worker's job is to produce output. The runner's job is to verify it and own the HCS protocol.

Workers stop knowing about task.complete. Their prompt says: write your output to {output_path} and stop. The runner handles everything from there — validation, retry, and posting to HCS.

This means only verified completions ever reach HCS. The on-chain record is clean.

---

Completion Contract

Each step in a workflow template defines what "done" looks like. This is the completion contract. The runner reads it from the task brief and validates against it.

role: developer

role: reviewer

role: ba

Completion types:

| Type | What runner checks | |---|---| | file | Path exists, non-empty, optionally min_length met | | signal | Worker output JSON contains the specified signal field | | none | Worker exits cleanly — runner posts complete immediately, no validation |

The evaluate field is optional. When present, Haiku reads it alongside the output to assess semantic correctness.

---

Runner Verification Loop

``` 1. Runner assembles brief (includes completion contract) 2. Invoke Claude CLI — worker attempt 1 3. Mechanical check against completion contract FAIL → skip Haiku, go to step 6 with mechanical failure reason

4. Haiku evaluation (if evaluate field present) Input: original brief + completion contract + worker output Output: pass / fail + specific diagnosis PASS → step 5 FAIL → go to step 6 with Haiku's diagnosis

5. Post task.complete to HCS → done

Original task brief
Completion contract ("here is what done looks like")
Specific failure reason from step 3 or 4
Explicit instruction: produce output that satisfies the contract

7. Invoke Claude CLI — revisionist attempt 8. Mechanical check FAIL → post task.blocked (see below) → done

9. Haiku evaluation (if evaluate field present) PASS → post task.complete → done FAIL → post task.blocked (see below) → done ```

One attempt. If the revisionist can't fix it, block. The diagnosis quality is what makes one attempt sufficient — the worker knows exactly what it got wrong.

---

Haiku Evaluator

Haiku is used for semantic evaluation only — it never produces task output. It receives:

The original task brief
The completion contract (type, path/signal, evaluate criteria)
The worker's output (file content or signal value)

It returns a structured verdict:

``json { "pass": false, "diagnosis": "Output addresses acceptance criteria 1 and 3 but omits error handling for the edge case described in criterion 2. No test coverage for the null input path." }``

Haiku is not invoked when evaluate is absent — mechanical check is sufficient for those steps.

Haiku is skipped entirely when type: none — no output, no evaluation.

Cost profile: Haiku evaluation is a small read-only call against an existing output. Negligible cost relative to the Sonnet worker invocation.

---

task.blocked Payload

When both attempts fail, the runner posts task.blocked with full context:

``json { "type": "task.blocked", "task_id": "uuid", "workflow_instance_id": "uuid", "step_num": 3, "agent": "quinn", "completion_spec": { "type": "file", "path": "feature/abc/output.md", "evaluate": "Must address all acceptance criteria..." }, "attempts": [ { "attempt": 1, "stage": "haiku_evaluation", "failure": "Output addresses criteria 1 and 3 but omits error handling for criterion 2. No test coverage for null input." }, { "attempt": 2, "stage": "mechanical", "failure": "Output file missing at feature/abc/output.md" } ], "ts": "ISO8601" }``

The Engine escalates to Sonnet on task.blocked. Sonnet has enough context to decide: re-post to roles.exec for human review, or attempt a different resolution.

The roles.exec item Justin sees is specific and actionable — not "Quinn got blocked" but the exact criteria, the exact failures, and what each attempt produced.

---

Worker Prompt Change

Current worker prompt includes instruction to post task.complete via hedera-tools MCP. This is removed.

Replacement:

``` When your work is complete, write your output to: {output_path}

Do not post any task.complete or task.blocked messages. The runner handles protocol. Your only job is to produce correct output at the specified path. ```

For type: signal tasks, the worker writes a JSON file to output_path containing the signal field:

``json { "signal": "approved", "reasoning": "..." }``

For type: none tasks, no output instruction is needed — runner posts complete on clean exit.

---

What Moves Where

| Responsibility | Before | After | |---|---|---| | Post task.complete | Worker (via MCP) | Runner | | Post task.blocked | Worker (via MCP) | Runner | | Validate output | Nobody | Runner (mechanical + Haiku) | | Retry on failure | Nobody | Runner (one revisionist attempt) | | Failure diagnosis | Nobody | Haiku evaluator |

---

Implementation Notes

Completion contract fields added to workflow_steps schema (migration)
Runner reads completion block from task brief at startup
Mechanical validation is a simple utility in agents/shared/runner.py
Haiku evaluation is a second Claude CLI invocation with a short, structured prompt
Revisionist brief is assembled by the runner — not a separate agent, just a re-invocation with additional context prepended
task.blocked payload extended with completion_spec + attempts array
Worker persona updated to remove HCS posting instructions

---

What This Delivers

Clean on-chain record — only verified completions reach HCS
Self-healing by default — most failures resolved in the runner, no human needed
Actionable blocks — when escalation is needed, Justin sees exactly why
No output cases handled — type: none is a first-class completion type
Semantic correctness — Haiku catches "technically valid, functionally wrong" before the workflow advances
Cheap — Haiku evaluation adds minimal cost; revisionist only fires on failure