ducklm/FOR_AI_REVIEW.md

249 lines
12 KiB
Markdown

# DuckLM Runtime Architecture Review
## 🧠 1. System Overview
**What is runtime?**
Runtime is the execution substrate of the system — a multi-layered cognitive execution environment that orchestrates LLMs, tools, memory, and permissions into a unified agentic workflow. It's the `RuntimeController` that composes `RuntimeLoop`, `ExecutionEngine`, `ContextBuilder`, `AsyncRouter`, `PermissionService`, and `EventBus`.
**What is the core loop?**
The core loop is the `RuntimeLoop.run_task()` method: it receives a `UserTask`, applies permission hard-stop checks, creates task state, builds context via `ContextBuilder`, routes via `AsyncRouter` to get a `directive`, executes via `ExecutionEngine`, applies `Critic` evaluation, saves via `MemoryPolicy`, publishes `RuntimeEvent`s through `EventBus`, and returns streaming output.
**Models (Orchestrator / Coder / Critic / Utility)**
- **Orchestrator** (`OrchestratorAdapter`/`AsyncOrchestratorAdapter`): LLM that decides plan vs direct respond vs tool; generates `ExecutionDirective` of type `plan`, `tool`, `respond`, `fail`, etc.
- **Coder** (`CoderAdapter`/`AsyncCoderAdapter`): LLM specialized for code generation and manipulation.
- **Critic** (`CriticAdapter`/`AsyncCriticAdapter`): Evaluates tool outputs with JSON scoring (correctness, usefulness, safety, memory_store, weight).
- **Utility**: The `sys_util` orchestrator — a fallback/orchestration layer for system-level operations.
**What is "truth"? (Event Store / State Store)**
- **Event Store** (`SQLiteEventStore`): Immutable append-only log of `RuntimeEvent`s per task. Source of truth for "what happened."
- **State Store** (`SQLiteTaskStateStore`): Current mutable task state (status, last_directive, pending requests). "Current truth" of task progress.
- **Checkpoint Store** (`SQLiteCheckpointStore`): Snapshots of task state + context at milestones.
- **Memory Store** (`MemoryStore` + `VectorIndex`): Long-term knowledge base with weighted entries.
---
## 🔁 2. End-to-End Flow
### High-Level Flow (as seen in logs)
```
User Input
→ Router (AsyncRouter.decide)
→ Context Builder (ContextBuilder.build)
→ Orchestrator (decides plan vs direct)
→ Plan / Direct Action
→ Execution Engine
→ Tool Layer (ToolRegistry + ToolSandbox)
→ Critic (AsyncCriticAdapter)
→ Memory Policy (MemoryWritePolicy)
→ Event Bus (SQLiteEventStore)
→ Streaming Output (via WebSocket / SSE)
```
### Conversation Flow
1. **Router** decides `plan` vs `respond` vs `tool` vs `fail` based on orchestrator output or intent parser.
2. **Context Builder** enriches task with memory context, tool context, execution context, and safety constraints.
3. **Orchestrator** (or direct respond) produces the initial `ExecutionDirective`.
4. **Execution Engine** schedules via `ExecutionScheduler`, then executes:
- `plan` → parse into `PlanStep`s, build task graph, execute ready steps
- `tool` → validate tool existence, check permissions, execute via `ToolRegistry`
- `respond` → direct response
- `fail` → immediate failure
5. **Tool Layer** (`ToolRegistry` + `ToolSandbox`):
- Plugin discovery via `ToolDiscovery`
- Manifest-based tool registration
- Sandboxed execution with timeout
6. **Critic** evaluates tool results (if enabled), outputs `CriticScore` JSON.
7. **Memory Policy** decides whether to insert `tool_result`, `critique`, `plan`, `fact`, `summary`, or `user_preference` into memory.
8. **Event Bus** (`SQLiteEventBus`) publishes `RuntimeEvent` with sequence ordering.
9. **Streaming Output** replays events via WebSocket and sends incremental responses.
### Failure Flow
- **Invalid JSON flow**: `ExecutionScheduler.parse_plan_steps` catches `JSONDecodeError` / `ValueError` / `TypeError`, logs warning, returns empty steps → `plan` fails with "Failed to parse plan steps."
- **Tool failure flow**: Tool execution returns `{"status": "failed", "result": {"error": "..."}}` → ExecutionEngine returns failed status → task state updated → event `TASK_FAILED` published → stops further plan steps.
- **Critic failure flow**: `_evaluate_with_critic` catches exception, logs warning, publishes `CRITIC_RESULT` with error → critic_score is `None` → execution continues without critique.
- **Orchestrator fallback flow**: If primary orchestrator fails or missing, `AsyncRouter` has `sys_util` fallback (utility orchestrator) for system-level decisions.
- **Permission denial flow**: `PermissionService.check_shell_command` / `check_write_path` returns `decision: "hard_stop"` or `decision: "deny"` → immediate failure with blocked reason; if `decision: "prompt"``TASK_AWAITING_PERMISSION` state.
### Repair Flow (JSON / Tool-call)
- Repair is triggered via `resolve_permission` or `resolve_secret` endpoints.
- Permission repair: user provides `decision` ("allow_once"/"allow_always"/"deny"/"ask_always") → `PermissionService.resolve_permission` → updates state → retries original directive.
- Secret repair: user provides secret string → `ExecutionEngine.execute` with `secret_override` → continues execution.
---
## ⚙️ 3. Component Breakdown
### `runtime_loop` (`RuntimeLoop`)
- **Responsibility**: Central task coordination; state management; event publishing.
- **Input**: `UserTask`
- **Output**: `{"task_id", "status", "directive", "result", "events"}`
- **Must NOT do**: Direct LLM calls (delegates to router/execution_engine); bypass state store.
### `execution_engine` (`ExecutionEngine`)
- **Responsibility**: Execute directives (plan/tool/respond/fail); integrate critic; interface with tool registry.
- **Input**: `UserTask`, `ExecutionDirective`, optional `permission_override`, `secret_override`
- **Output**: `{"status", "result", "step_results"}`
- **Must NOT do**: Bypass permission checks; skip critic evaluation when enabled; leak secrets in logs.
### `scheduler` (`ExecutionScheduler`)
- **Responsibility**: Parse plan JSON, build task dependency graph, yield ready steps, detect cycles.
- **Input**: JSON plan string, `task_id`
- **Output**: `list[PlanStep]`
- **Must NOT do**: Execute anything; modify task state directly.
### `tool_registry` (`ToolRegistry`)
- **Responsibility**: Register/manifest tools; execute via `ToolSandbox`; provide schema metadata.
- **Input**: tool name, args dict
- **Output**: `ToolResult`
- **Must NOT do**: Bypass sandbox; execute privileged host commands without sandbox.
### `event_bus` (`EventBus` → `SQLiteEventStore`)
- **Responsibility**: Append-only event persistence; sequence numbering; per-task query.
- **Input**: `RuntimeEvent`
- **Output**: event stream
- **Must NOT do**: Modify state store directly (state is separate); delete or mutate events.
### `memory` (`MemoryInterface` → `MemoryStore` + `VectorIndex`)
- **Responsibility**: Store/retrieve weighted memory entries; vector similarity search; integrate with context builder.
- **Input**: text, kind, source, weight, metadata
- **Output**: search results or insertion confirmation
- **Must NOT do**: Expose raw embeddings without access control; store secrets.
---
## 🧩 4. Data Contracts
### `PlanStep`
```python
id: str
kind: Literal["tool", "coder", "memory", "respond"]
tool: str | None
args: dict[str, Any]
description: str
requires_confirmation: bool
depends_on: list[str]
```
**Real example** (from `router` prompt engineering):
`{"id":"step-0","kind":"tool","tool":"shell_exec","args":{"command":"ls -la"},"description":"List directory","requires_confirmation":false,"depends_on":[]}`
### `ToolCall`
```python
tool: str
args: dict[str, Any]
task_id: str
step_id: str
```
**Real log**: `TOOL_CALLED` event with `{"tool":"shell_exec","args":{"command":"pwd"},"task_id":"xyz","step_id":"step-0"}`
### `ToolResult`
```python
tool: str
ok: bool
output: Any
error: str | None
metadata: dict[str, Any]
```
**Real output**: `{"tool":"shell_exec","ok":true,"output":"/app","error":null,"metadata":{}}`
### `RuntimeEvent`
```python
event_id: str
task_id: str
session_id: str
sequence: int
type: str # e.g. TASK_RECEIVED, TOOL_CALLED, TASK_COMPLETED
payload: dict[str, Any]
causation_id: str | None
correlation_id: str
```
**Real event stream**: `TASK_RECEIVED → CONTEXT_BUILT → PLAN_STARTED → TOOL_CALLED → TOOL_COMPLETED → TASK_COMPLETED`
### `MemoryEntry`
```python
id: str
text: str
kind: Literal["tool_result","plan","critique","fact","summary","user_preference"]
source: Literal["tool","critic","user","system"]
weight: float
task_id: str | None
session_id: str | None
metadata: dict[str, Any]
embedding_model: str
embedding_dim: int
```
**Real insertion**: After critic evaluation, `kind="critique"`, `source="critic"`, `weight=0.85`, metadata includes scores.
---
## 🔥 5. Failure Modes
### Invalid JSON Flow
- **Trigger**: Malformed plan JSON (e.g., missing braces, non-JSON string).
- **Detection**: `parse_plan_steps` catches `JSONDecodeError` / `ValueError` / `TypeError`.
- **Result**: Warning logged, empty steps returned → `PLAN_FAILED` with `"Failed to parse plan steps from directive"`.
### Tool Failure Flow
- **Trigger**: Tool returns `ok=False` or raises exception in sandbox.
- **Detection**: `_execute_tool` checks `tool_result.ok`.
- **Result**: Status `"failed"`, result contains `{"error": "...", "failed_step": step.id, "step_results": [...]}``TASK_FAILED` event; further plan steps skipped.
### Critic Failure Flow
- **Trigger**: Critic adapter raises exception or returns non-JSON output.
- **Detection**: `_evaluate_with_critic` catches exception, logs warning.
- **Result**: Event `CRITIC_RESULT` with error payload → `critic_score = None` → execution continues without critique; memory write skipped.
### Orchestrator Fallback Flow
- **Trigger**: Primary orchestrator model unavailable or returns invalid directive.
- **Detection**: `_ensure_orchestrator` returns `None`; router falls back to `sys_util` orchestrator.
- **Result**: Utility orchestrator handles system-level decisions (e.g., file operations, environment queries).
### Permission Denial Flow
- **Trigger**: `PermissionService` returns `decision: "hard_stop"` or `"deny"`.
- **Detection**: `_execute_tool` checks `permission_result`.
- **Result**: Immediate failure with `"Command blocked: ..."``TASK_FAILED`; no tool execution.
---
## 🧠 6. "Decision Logic Map"
### Orchestrator vs Direct Respond
- **Use orchestrator** when: task requires planning, multi-step tool usage, or unknown intent. Orchestrator decides to emit `plan` or `tool` directive.
- **Direct respond** when: intent parser classifies as simple query (`TASK_RECEIVED` → `router.intent_parser``respond` directive) or `respond` directive explicitly set.
### Utility Model Call
- Invoked when `sys_util` orchestrator is loaded (configurable). Used for system-level operations: environment inspection, file system queries, or when primary orchestrator fails and fallback is needed.
### Retry Logic
- **Planner retry**: `ExecutionScheduler` has `retry_limit=2`; on parse/validation failure, retries up to limit before failing plan.
- **Tool retry**: Not implemented natively; retry must be encoded in plan steps (`depends_on`, manual replan).
### Plan Creation
- **Trigger**: Orchestrator output contains `{type: "plan", ...}` or explicit `plan` directive.
- **Process**: `parse_plan_steps``validate_no_cycles``build_task_graph` → ready steps execution.
- **No plan**: Orchestrator outputs `respond` or `tool` → direct execution.
---
## 🧰 7. Tool System Architecture
### Plugin Discovery
- `ToolDiscovery` scans `app/tools/plugins/` for modules exporting `Tool` classes.
- Discovers: `shell_exec`, `file_read`, `file_write`, `memory` (search/insert/list).
### Manifest-Based Tools
- Each plugin has a `manifest.json` with:
- `description`: human-readable docstring.
- `args_schema`: JSON schema for validation.
- `requires_permission`: boolean for privileged tools (`shell_exec`, `file_write`).
- On discovery, registry registers tool and stores schema for permission/routing.
### Registry Bootstrap
- `RuntimeController._create_tool_registry()` initializes discovery, loads plugins, registers with init mapping (sandbox, permissions).
- Tools are initialized once at startup; `tool_registry` is shared across executions.
### Execution Isolation
- **ToolSandbox** (`ToolSandbox`):
- Restricts filesystem to `allowed_root` (project base dir).
- Timeout per execution (`step_timeout_ms`).
- Blocks `sudo` without secret override; requires secret injection for sudo commands.
- **Permission gating**: `shell_exec` and `file_write` require explicit permission decision before execution.