{
  "id": "iris-scratch-4d45b45f67",
  "scope": "redkey",
  "source_of_truth": "repo",
  "source_path": "docs/specs/iris-scratch.md",
  "source_kind": "markdown",
  "visibility": "internal",
  "renderer_id": "design_doc.dreamborn-forge.generated.v1",
  "design_system": "dreamborn-design-system:forge",
  "generated_at": "2026-05-09T13:00:55.765Z",
  "artifact_type": "design_doc",
  "schema_version": "design_doc.generated.v1",
  "title": "Iris — Governance Scratch Pad",
  "summary": "Iris — Governance Scratch Pad Agent: iris Topic: 0.0.8718898 Status: defined, not yet implemented Living doc. Log observed failure modes, design notes, and governance scenarios here before formal spec work begins. Failure Modes Observed in Production 1. Zombie Task (observed 2026 04 24) What happened: Quinn claimed task fa1fc353 on roles.developer. Claude se...",
  "format_source": "markdown",
  "sections": [
    {
      "title": "Iris — Governance Scratch Pad",
      "level": 1,
      "body": "Agent: `iris` | Topic: `0.0.8718898` | Status: defined, not yet implemented\n\nLiving doc. Log observed failure modes, design notes, and governance scenarios here before formal spec work begins.\n\n---"
    },
    {
      "title": "1. Zombie Task (observed 2026-04-24)",
      "level": 3,
      "body": "**What happened:** Quinn claimed task `fa1fc353` on `roles.developer`. Claude session exited with code 1 (crash/max_turns) after ~1 minute. Runner only posts `task.complete` on exit code 0 — on failure, nothing is posted. Task remained permanently stuck: `task.claim` on HCS, no `task.complete` or `task.blocked` ever follows.\n\n**Effect:** Task is unclaimed from anyone else's perspective. Agent can't re-claim it (it's in the completed set). Humans have no visibility without digging through logs.\n\n**Root causes:**\n- Runner doesn't post `task.blocked` on non-zero exit (runner-level fix: patch `base.py`)\n- No external sweep detects \"claimed but unresolved past timeout\"\n\n**Iris's role:** Sweep role topics periodically. For any `task.claim` with no subsequent `task.complete` or `task.blocked` past a timeout (e.g., 30 minutes), post `task.blocked` with reason `\"governance: claim timeout — no completion observed\"` and escalate to `roles.exec`.\n\n---"
    },
    {
      "title": "2. Stale Task Pollution (observed 2026-04-24)",
      "level": 3,
      "body": "**What happened:** Cancelled workflows leave `task.available` messages on role topics. Agents keep claiming them, burning cycles, hitting max_turns, looping forever.\n\n**Effect:** Agents waste resources on tasks that will never produce useful output.\n\n**Iris's role:** On workflow cancellation, sweep role topics for `task.available` messages from that `workflow_instance_id` with no subsequent claim or blocked status. Post `task.blocked` with reason `\"governance: parent workflow cancelled\"`.\n\n---"
    },
    {
      "title": "3. Claim Loop (observed 2026-04-24)",
      "level": 3,
      "body": "**What happened:** Agent hits max_turns, exits without posting `task.complete` or `task.blocked`. Runner restarts, re-claims the same task (winner-is-me verify_claim logic), runs again, hits max_turns again. Infinite loop.\n\n**Effect:** Burning API spend and HBAR with no progress.\n\n**Iris's role:** Detect same `task_id` claimed 3+ times by the same agent with no completion between claims. Post `task.blocked`, escalate to `roles.exec`: \"Agent X has claimed task Y 3 times without completing.\"\n\n---"
    },
    {
      "title": "4. Agent Stuck in Working Status",
      "level": 3,
      "body": "**What:** Agent posts `working` status to `agents.<id>` but never transitions to `idle` or `complete` past the expected duration for the task type.\n\n**Iris's role:** Monitor `agent_state` for agents stuck in `working` past threshold. Alert to `roles.exec`.\n\n---"
    },
    {
      "title": "Design Principles",
      "level": 2,
      "body": "- Iris never does work herself — she detects, blocks, and escalates\n- All Iris actions originate on HCS (`task.blocked`, escalation messages to `roles.exec`) — full audit trail\n- Iris should be idempotent: posting `task.blocked` for an already-blocked task is a no-op\n- Timeout thresholds should be configurable per role/task type (architect tasks take longer than BA tasks)\n- Escalations to `roles.exec` should include: task_id, agent, how long stuck, what was last observed"
    },
    {
      "title": "Suggested Timeout Thresholds (draft)",
      "level": 2,
      "body": "| Role | Expected max duration | Iris timeout |\n|---|---|---|\n| `roles.developer` | 20 min | 45 min |\n| `roles.architect` | 15 min | 30 min |\n| `roles.ba` | 10 min | 20 min |\n| `roles.coordinator` | 5 min | 15 min |\n| `roles.exec` | no timeout | no timeout |"
    },
    {
      "title": "Open Questions",
      "level": 2,
      "body": "- Does Iris run on a fixed cron (e.g., every 5 min) or event-driven?\n- Should Iris post directly to role topics or only to `roles.exec`?\n- How does Iris know a workflow was cancelled? Reads `workflow_instances.status` from Supabase?\n- Should Iris have her own HCS topic for governance events, or use `agents.iris`?"
    }
  ],
  "html_path": "artifacts/iris-scratch-4d45b45f67.html",
  "json_path": "artifacts/iris-scratch-4d45b45f67.json"
}