ducklm/FOR_AI_REVIEW.md

12 KiB

DuckLM Runtime Architecture Review

🧠 1. System Overview

What is runtime? Runtime is the execution substrate of the system — a multi-layered cognitive execution environment that orchestrates LLMs, tools, memory, and permissions into a unified agentic workflow. It's the RuntimeController that composes RuntimeLoop, ExecutionEngine, ContextBuilder, AsyncRouter, PermissionService, and EventBus.

What is the core loop? The core loop is the RuntimeLoop.run_task() method: it receives a UserTask, applies permission hard-stop checks, creates task state, builds context via ContextBuilder, routes via AsyncRouter to get a directive, executes via ExecutionEngine, applies Critic evaluation, saves via MemoryPolicy, publishes RuntimeEvents through EventBus, and returns streaming output.

Models (Orchestrator / Coder / Critic / Utility)

  • Orchestrator (OrchestratorAdapter/AsyncOrchestratorAdapter): LLM that decides plan vs direct respond vs tool; generates ExecutionDirective of type plan, tool, respond, fail, etc.
  • Coder (CoderAdapter/AsyncCoderAdapter): LLM specialized for code generation and manipulation.
  • Critic (CriticAdapter/AsyncCriticAdapter): Evaluates tool outputs with JSON scoring (correctness, usefulness, safety, memory_store, weight).
  • Utility: The sys_util orchestrator — a fallback/orchestration layer for system-level operations.

What is "truth"? (Event Store / State Store)

  • Event Store (SQLiteEventStore): Immutable append-only log of RuntimeEvents per task. Source of truth for "what happened."
  • State Store (SQLiteTaskStateStore): Current mutable task state (status, last_directive, pending requests). "Current truth" of task progress.
  • Checkpoint Store (SQLiteCheckpointStore): Snapshots of task state + context at milestones.
  • Memory Store (MemoryStore + VectorIndex): Long-term knowledge base with weighted entries.

🔁 2. End-to-End Flow

High-Level Flow (as seen in logs)

User Input
→ Router (AsyncRouter.decide)
→ Context Builder (ContextBuilder.build)
→ Orchestrator (decides plan vs direct)
→ Plan / Direct Action
→ Execution Engine
→ Tool Layer (ToolRegistry + ToolSandbox)
→ Critic (AsyncCriticAdapter)
→ Memory Policy (MemoryWritePolicy)
→ Event Bus (SQLiteEventStore)
→ Streaming Output (via WebSocket / SSE)

Conversation Flow

  1. Router decides plan vs respond vs tool vs fail based on orchestrator output or intent parser.
  2. Context Builder enriches task with memory context, tool context, execution context, and safety constraints.
  3. Orchestrator (or direct respond) produces the initial ExecutionDirective.
  4. Execution Engine schedules via ExecutionScheduler, then executes:
    • plan → parse into PlanSteps, build task graph, execute ready steps
    • tool → validate tool existence, check permissions, execute via ToolRegistry
    • respond → direct response
    • fail → immediate failure
  5. Tool Layer (ToolRegistry + ToolSandbox):
    • Plugin discovery via ToolDiscovery
    • Manifest-based tool registration
    • Sandboxed execution with timeout
  6. Critic evaluates tool results (if enabled), outputs CriticScore JSON.
  7. Memory Policy decides whether to insert tool_result, critique, plan, fact, summary, or user_preference into memory.
  8. Event Bus (SQLiteEventBus) publishes RuntimeEvent with sequence ordering.
  9. Streaming Output replays events via WebSocket and sends incremental responses.

Failure Flow

  • Invalid JSON flow: ExecutionScheduler.parse_plan_steps catches JSONDecodeError / ValueError / TypeError, logs warning, returns empty steps → plan fails with "Failed to parse plan steps."
  • Tool failure flow: Tool execution returns {"status": "failed", "result": {"error": "..."}} → ExecutionEngine returns failed status → task state updated → event TASK_FAILED published → stops further plan steps.
  • Critic failure flow: _evaluate_with_critic catches exception, logs warning, publishes CRITIC_RESULT with error → critic_score is None → execution continues without critique.
  • Orchestrator fallback flow: If primary orchestrator fails or missing, AsyncRouter has sys_util fallback (utility orchestrator) for system-level decisions.
  • Permission denial flow: PermissionService.check_shell_command / check_write_path returns decision: "hard_stop" or decision: "deny" → immediate failure with blocked reason; if decision: "prompt"TASK_AWAITING_PERMISSION state.

Repair Flow (JSON / Tool-call)

  • Repair is triggered via resolve_permission or resolve_secret endpoints.
  • Permission repair: user provides decision ("allow_once"/"allow_always"/"deny"/"ask_always") → PermissionService.resolve_permission → updates state → retries original directive.
  • Secret repair: user provides secret string → ExecutionEngine.execute with secret_override → continues execution.

⚙️ 3. Component Breakdown

runtime_loop (RuntimeLoop)

  • Responsibility: Central task coordination; state management; event publishing.
  • Input: UserTask
  • Output: {"task_id", "status", "directive", "result", "events"}
  • Must NOT do: Direct LLM calls (delegates to router/execution_engine); bypass state store.

execution_engine (ExecutionEngine)

  • Responsibility: Execute directives (plan/tool/respond/fail); integrate critic; interface with tool registry.
  • Input: UserTask, ExecutionDirective, optional permission_override, secret_override
  • Output: {"status", "result", "step_results"}
  • Must NOT do: Bypass permission checks; skip critic evaluation when enabled; leak secrets in logs.

scheduler (ExecutionScheduler)

  • Responsibility: Parse plan JSON, build task dependency graph, yield ready steps, detect cycles.
  • Input: JSON plan string, task_id
  • Output: list[PlanStep]
  • Must NOT do: Execute anything; modify task state directly.

tool_registry (ToolRegistry)

  • Responsibility: Register/manifest tools; execute via ToolSandbox; provide schema metadata.
  • Input: tool name, args dict
  • Output: ToolResult
  • Must NOT do: Bypass sandbox; execute privileged host commands without sandbox.

event_bus (EventBusSQLiteEventStore)

  • Responsibility: Append-only event persistence; sequence numbering; per-task query.
  • Input: RuntimeEvent
  • Output: event stream
  • Must NOT do: Modify state store directly (state is separate); delete or mutate events.

memory (MemoryInterfaceMemoryStore + VectorIndex)

  • Responsibility: Store/retrieve weighted memory entries; vector similarity search; integrate with context builder.
  • Input: text, kind, source, weight, metadata
  • Output: search results or insertion confirmation
  • Must NOT do: Expose raw embeddings without access control; store secrets.

🧩 4. Data Contracts

PlanStep

id: str
kind: Literal["tool", "coder", "memory", "respond"]
tool: str | None
args: dict[str, Any]
description: str
requires_confirmation: bool
depends_on: list[str]

Real example (from router prompt engineering): {"id":"step-0","kind":"tool","tool":"shell_exec","args":{"command":"ls -la"},"description":"List directory","requires_confirmation":false,"depends_on":[]}

ToolCall

tool: str
args: dict[str, Any]
task_id: str
step_id: str

Real log: TOOL_CALLED event with {"tool":"shell_exec","args":{"command":"pwd"},"task_id":"xyz","step_id":"step-0"}

ToolResult

tool: str
ok: bool
output: Any
error: str | None
metadata: dict[str, Any]

Real output: {"tool":"shell_exec","ok":true,"output":"/app","error":null,"metadata":{}}

RuntimeEvent

event_id: str
task_id: str
session_id: str
sequence: int
type: str  # e.g. TASK_RECEIVED, TOOL_CALLED, TASK_COMPLETED
payload: dict[str, Any]
causation_id: str | None
correlation_id: str

Real event stream: TASK_RECEIVED → CONTEXT_BUILT → PLAN_STARTED → TOOL_CALLED → TOOL_COMPLETED → TASK_COMPLETED

MemoryEntry

id: str
text: str
kind: Literal["tool_result","plan","critique","fact","summary","user_preference"]
source: Literal["tool","critic","user","system"]
weight: float
task_id: str | None
session_id: str | None
metadata: dict[str, Any]
embedding_model: str
embedding_dim: int

Real insertion: After critic evaluation, kind="critique", source="critic", weight=0.85, metadata includes scores.


🔥 5. Failure Modes

Invalid JSON Flow

  • Trigger: Malformed plan JSON (e.g., missing braces, non-JSON string).
  • Detection: parse_plan_steps catches JSONDecodeError / ValueError / TypeError.
  • Result: Warning logged, empty steps returned → PLAN_FAILED with "Failed to parse plan steps from directive".

Tool Failure Flow

  • Trigger: Tool returns ok=False or raises exception in sandbox.
  • Detection: _execute_tool checks tool_result.ok.
  • Result: Status "failed", result contains {"error": "...", "failed_step": step.id, "step_results": [...]}TASK_FAILED event; further plan steps skipped.

Critic Failure Flow

  • Trigger: Critic adapter raises exception or returns non-JSON output.
  • Detection: _evaluate_with_critic catches exception, logs warning.
  • Result: Event CRITIC_RESULT with error payload → critic_score = None → execution continues without critique; memory write skipped.

Orchestrator Fallback Flow

  • Trigger: Primary orchestrator model unavailable or returns invalid directive.
  • Detection: _ensure_orchestrator returns None; router falls back to sys_util orchestrator.
  • Result: Utility orchestrator handles system-level decisions (e.g., file operations, environment queries).

Permission Denial Flow

  • Trigger: PermissionService returns decision: "hard_stop" or "deny".
  • Detection: _execute_tool checks permission_result.
  • Result: Immediate failure with "Command blocked: ..."TASK_FAILED; no tool execution.

🧠 6. "Decision Logic Map"

Orchestrator vs Direct Respond

  • Use orchestrator when: task requires planning, multi-step tool usage, or unknown intent. Orchestrator decides to emit plan or tool directive.
  • Direct respond when: intent parser classifies as simple query (TASK_RECEIVEDrouter.intent_parserrespond directive) or respond directive explicitly set.

Utility Model Call

  • Invoked when sys_util orchestrator is loaded (configurable). Used for system-level operations: environment inspection, file system queries, or when primary orchestrator fails and fallback is needed.

Retry Logic

  • Planner retry: ExecutionScheduler has retry_limit=2; on parse/validation failure, retries up to limit before failing plan.
  • Tool retry: Not implemented natively; retry must be encoded in plan steps (depends_on, manual replan).

Plan Creation

  • Trigger: Orchestrator output contains {type: "plan", ...} or explicit plan directive.
  • Process: parse_plan_stepsvalidate_no_cyclesbuild_task_graph → ready steps execution.
  • No plan: Orchestrator outputs respond or tool → direct execution.

🧰 7. Tool System Architecture

Plugin Discovery

  • ToolDiscovery scans app/tools/plugins/ for modules exporting Tool classes.
  • Discovers: shell_exec, file_read, file_write, memory (search/insert/list).

Manifest-Based Tools

  • Each plugin has a manifest.json with:
    • description: human-readable docstring.
    • args_schema: JSON schema for validation.
    • requires_permission: boolean for privileged tools (shell_exec, file_write).
  • On discovery, registry registers tool and stores schema for permission/routing.

Registry Bootstrap

  • RuntimeController._create_tool_registry() initializes discovery, loads plugins, registers with init mapping (sandbox, permissions).
  • Tools are initialized once at startup; tool_registry is shared across executions.

Execution Isolation

  • ToolSandbox (ToolSandbox):
    • Restricts filesystem to allowed_root (project base dir).
    • Timeout per execution (step_timeout_ms).
    • Blocks sudo without secret override; requires secret injection for sudo commands.
  • Permission gating: shell_exec and file_write require explicit permission decision before execution.