ducklm/docs/bench/utility_model_bench_2026052...

392 lines
20 KiB
JSON
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[
{
"model": "Qwen3.6-35B nonMTP GPU baseline",
"quality": 0.971,
"avg_latency_seconds": 17.94,
"avg_tokens_per_second": 4.51,
"cases": [
{
"role": "action",
"case": "direct_answer_no_tools",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 15.307,
"completion_tokens": 45,
"tokens_per_second": 2.94,
"content_preview": "{\n \"kind\": \"action_directive\",\n \"intent\": \"direct_answer\",\n \"risk_level\": \"low\",\n \"actions\": []\n}\n"
},
{
"role": "action",
"case": "read_specific_file",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 19.612,
"completion_tokens": 81,
"tokens_per_second": 4.13,
"content_preview": "{\n \"kind\": \"action_directive\",\n \"intent\": \"read_file\",\n \"risk_level\": \"low\",\n \"actions\": [\n {\n \"tool\": \"file_read\",\n \"args\": {\n \"path\": \"CURRENT_STATE.md\"\n }\n }\n ]\n}\n"
},
{
"role": "memory_policy",
"case": "store_user_preference",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 18.533,
"completion_tokens": 88,
"tokens_per_second": 4.75,
"content_preview": "{\n \"should_store\": true,\n \"memory_type\": \"preference\",\n \"summary\": \"User prefers responses in Russian and requires explicit confirmation before executing sudo commands.\",\n \"importance\": 0.9,\n \"scope\": \"global\",\n \"metadata\": {\n \"language\": \"ru\",\n \"security_policy\": \"sudo_confirmation_required\"\n }\n}"
},
{
"role": "memory_policy",
"case": "ignore_trivial_tool_call",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 15.0,
"completion_tokens": 61,
"tokens_per_second": 4.07,
"content_preview": "{\n \"should_store\": false,\n \"memory_type\": \"note\",\n \"summary\": \"Routine execution of pwd command returning /tmp/project.\",\n \"importance\": 0.1,\n \"scope\": \"conversation\",\n \"metadata\": {}\n}"
},
{
"role": "recall",
"case": "select_relevant_memory",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 15.085,
"completion_tokens": 66,
"tokens_per_second": 4.38,
"content_preview": "{\n \"relevant_ids\": [\n \"m1\"\n ],\n \"reasoning\": \"Memory m1 directly addresses the user's preference regarding the execution of sudo commands, which is the core of the query. Memories m2 and m3 are unrelated to sudo or command execution preferences.\"\n}"
},
{
"role": "summary",
"case": "preserve_decisions",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 9.954,
"completion_tokens": 44,
"tokens_per_second": 4.42,
"content_preview": "1. WebChat работает на 8000, llama-server на 8081.\n2. Для внешних путей нужен approval.\n3. allow_forever хранится по normalized action hash."
},
{
"role": "critic",
"case": "reflection_quality",
"score": 0.8,
"note": "missing=['lesson']",
"elapsed_seconds": 32.088,
"completion_tokens": 220,
"tokens_per_second": 6.86,
"content_preview": "**Critic Reflection: WebChat API Incident**\n\n**1. Risk Assessment**\n\n* **Operational Fragility (High):** The incident revealed a critical dependency on manual intervention. If the API crashes or the server reboots, WebChat will silently fail or return errors until an engineer notices and manually restarts the service. This creates a \"single point of failure\" in the operational process.\n* **Meaning Time to Recovery (MTTR) Variance:** Recovery time is currently dependent on human availability "
}
]
},
{
"model": "Menlo_Lucy-Q4_K_M CPU",
"quality": 0.771,
"avg_latency_seconds": 4.406,
"avg_tokens_per_second": 16.21,
"cases": [
{
"role": "action",
"case": "direct_answer_no_tools",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 3.228,
"completion_tokens": 31,
"tokens_per_second": 9.6,
"content_preview": "{\"kind\": \"action_directive\", \"intent\": \"answer_question\", \"risk_level\": \"low\", \"actions\": []}"
},
{
"role": "action",
"case": "read_specific_file",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 3.031,
"completion_tokens": 48,
"tokens_per_second": 15.84,
"content_preview": "{\"kind\": \"action_directive\", \"intent\": \"file_read\", \"risk_level\": \"low\", \"actions\": [{\"tool\": \"file_read\", \"args\": {\"path\": \"CURRENT_STATE.md\"}}]}\n\n"
},
{
"role": "memory_policy",
"case": "store_user_preference",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 3.62,
"completion_tokens": 54,
"tokens_per_second": 14.92,
"content_preview": "{\n \"should_store\": true,\n \"memory_type\": \"preference\",\n \"summary\": \"User prefers Russian and requires sudo confirmation.\",\n \"importance\": 0.7,\n \"scope\": \"global\",\n \"metadata\": {}\n}"
},
{
"role": "memory_policy",
"case": "ignore_trivial_tool_call",
"score": 0.3,
"note": "stored_trivial={'should_store': True, 'memory_type': 'fact', 'summary': 'Password was successfully launched and user was informed.', 'importance': 0.7, 'scope': 'global', 'metadata': {}}",
"elapsed_seconds": 3.192,
"completion_tokens": 58,
"tokens_per_second": 18.17,
"content_preview": "{\n \"should_store\": true,\n \"memory_type\": \"fact\",\n \"summary\": \"Password was successfully launched and user was informed.\",\n \"importance\": 0.7,\n \"scope\": \"global\",\n \"metadata\": {}\n}\n"
},
{
"role": "recall",
"case": "select_relevant_memory",
"score": 0.3,
"note": "wrong_ids=[]",
"elapsed_seconds": 3.737,
"completion_tokens": 60,
"tokens_per_second": 16.05,
"content_preview": "{\n \"relevant_ids\": [],\n \"reasoning\": \"The query is about how the user wants to run sudo, but none of the provided memories are related to sudo or user preferences for running commands. The memories are about SQLite and weather answers, which are unrelated to the query.\"\n}"
},
{
"role": "summary",
"case": "preserve_decisions",
"score": 0.8,
"note": "missing=['approval']",
"elapsed_seconds": 3.334,
"completion_tokens": 61,
"tokens_per_second": 18.29,
"content_preview": "1. WebChat работает на портах 8000, а LLaMA-Server на 8081. \n2. Для внешних путей требуется подтверждение. \n3. allow_forever хранится по hash-коду действий."
},
{
"role": "critic",
"case": "reflection_quality",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 10.697,
"completion_tokens": 220,
"tokens_per_second": 20.57,
"content_preview": "Okay, the user wants me to reflect on the risk and reusable lessons from fixing the WebChat issue. Let me break this down.\n\nFirst, the root cause was the API not running. They manually started the API but didn't have a unified service script. So the main risk here is that without a unified script, there's a chance the API might not be running consistently or could be misconfigured.\n\nReusability is key here. Maybe we can create a script that starts the API and monitors its status. That way, if th"
}
]
},
{
"model": "Qwen3.5-9B-GLM5.1-Distill-v1-Q4_K_M CPU",
"quality": 0.4,
"avg_latency_seconds": 61.939,
"avg_tokens_per_second": 2.56,
"cases": [
{
"role": "action",
"case": "direct_answer_no_tools",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 68.075,
"completion_tokens": 72,
"tokens_per_second": 1.06,
"content_preview": "{\"kind\": \"action_directive\", \"intent\": \"answer\", \"risk_level\": \"high\", \"actions\": []}"
},
{
"role": "action",
"case": "read_specific_file",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 72.149,
"completion_tokens": 86,
"tokens_per_second": 1.19,
"content_preview": "{\"kind\": \"action_directive\", \"intent\": \"read file\", \"risk_level\": \"low\", \"actions\": [{\"tool\": \"file_read\", \"args\": {\"path\": \"CURRENT_STATE.md\"}}]}\n"
},
{
"role": "memory_policy",
"case": "store_user_preference",
"score": 0.0,
"note": "invalid_json: Expecting value: line 1 column 1 (char 0)",
"elapsed_seconds": 67.764,
"completion_tokens": 180,
"tokens_per_second": 2.66,
"content_preview": ""
},
{
"role": "memory_policy",
"case": "ignore_trivial_tool_call",
"score": 0.0,
"note": "invalid_json: Expecting value: line 1 column 1 (char 0)",
"elapsed_seconds": 64.653,
"completion_tokens": 160,
"tokens_per_second": 2.47,
"content_preview": ""
},
{
"role": "recall",
"case": "select_relevant_memory",
"score": 0.0,
"note": "invalid_json: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)",
"elapsed_seconds": 59.451,
"completion_tokens": 160,
"tokens_per_second": 2.69,
"content_preview": "{\n"
},
{
"role": "summary",
"case": "preserve_decisions",
"score": 0.2,
"note": "missing=['8000', '8081', 'approval', 'allow_forever']",
"elapsed_seconds": 47.05,
"completion_tokens": 180,
"tokens_per_second": 3.83,
"content_preview": ""
},
{
"role": "critic",
"case": "reflection_quality",
"score": 0.6,
"note": "missing=['risk', 'lesson']",
"elapsed_seconds": 54.433,
"completion_tokens": 220,
"tokens_per_second": 4.04,
"content_preview": ""
}
]
},
{
"model": "X-Coder-SFT-Qwen3-8B.Q6_K CPU",
"quality": 0.757,
"avg_latency_seconds": 60.117,
"avg_tokens_per_second": 2.51,
"cases": [
{
"role": "action",
"case": "direct_answer_no_tools",
"score": 0.0,
"note": "invalid_json: Expecting ',' delimiter: line 13 column 6 (char 632)",
"elapsed_seconds": 121.049,
"completion_tokens": 180,
"tokens_per_second": 1.49,
"content_preview": "{\n \"kind\": \"action_directive\",\n \"intent\": \"answer\",\n \"risk_level\": \"low\",\n \"actions\": [\n {\n \"tool\": \"coder\",\n \"args\": {\n \"task_description\": \"Объясните коротко, что такое DuckLM\",\n \"language\": \"python\",\n \"context\": \"Пользователь запросил краткое описание DuckLM. DuckLM - это проект, который развивается сообществом, направленный на создание открытой и этичной языковой модели. Мы можем сказать, что это модель на основе Llama, созданная Meta, но адаптированная "
},
{
"role": "action",
"case": "read_specific_file",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 37.562,
"completion_tokens": 134,
"tokens_per_second": 3.57,
"content_preview": "{\n \"kind\": \"action_directive\",\n \"intent\": \"read_current_state\",\n \"risk_level\": \"low\",\n \"actions\": [\n {\n \"tool\": \"file_read\",\n \"args\": {\n \"path\": \"CURRENT_STATE.md\"\n },\n \"reason\": \"We need to read the content of CURRENT_STATE.md to provide a summary of the project status.\"\n }\n ],\n \"expected_observations\": [\n \"file_content\"\n ],\n \"stop_reason\": \"After reading the file, we can directly generate the response without further tools.\"\n}\n"
},
{
"role": "memory_policy",
"case": "store_user_preference",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 66.979,
"completion_tokens": 80,
"tokens_per_second": 1.19,
"content_preview": "{\n \"should_store\": true,\n \"memory_type\": \"preference\",\n \"summary\": \"User prefers responses in Russian and explicit sudo confirmation.\",\n \"importance\": 0.9,\n \"scope\": \"global\",\n \"metadata\": {\n \"language\": \"Russian\",\n \"sudo_policy\": \"explicit_confirmation_required\"\n }\n}\n"
},
{
"role": "memory_policy",
"case": "ignore_trivial_tool_call",
"score": 0.3,
"note": "stored_trivial={'should_store': True, 'memory_type': 'event', 'summary': 'User executed pwd command and received /tmp/project as output.', 'importance': 0.8, 'scope': 'conversation', 'metadata': {}}",
"elapsed_seconds": 21.772,
"completion_tokens": 62,
"tokens_per_second": 2.85,
"content_preview": "{\n \"should_store\": true,\n \"memory_type\": \"event\",\n \"summary\": \"User executed pwd command and received /tmp/project as output.\",\n \"importance\": 0.8,\n \"scope\": \"conversation\",\n \"metadata\": {}\n}\n"
},
{
"role": "recall",
"case": "select_relevant_memory",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 58.656,
"completion_tokens": 90,
"tokens_per_second": 1.53,
"content_preview": "{\n \"relevant_ids\": [\"m1\"],\n \"reasoning\": \"The query is about how the user wants to run 'sudo'. The only memory that directly relates to this is memory m1, which states that the user does not want to run sudo without explicit confirmation. The other memories (m2 and m3) are about unrelated topics (database and weather). Therefore, only m1 is relevant.\"\n}\n"
},
{
"role": "summary",
"case": "preserve_decisions",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 53.245,
"completion_tokens": 180,
"tokens_per_second": 3.38,
"content_preview": "We are going to compress the task context into 3 key points while preserving the decisions and outcomes.\n\nKey Points:\n1. **Service Endpoints**: \n - WebChat is running on port 8000.\n - llama-server is running on port 8081.\n\n2. **External Paths Approval**:\n - Any external path that requires access must be approved. This implies that we need to have a mechanism (like a firewall or access control list) to check and approve external requests.\n\n3. **Forever Allowance Storage**:\n - The `allow_f"
},
{
"role": "critic",
"case": "reflection_quality",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 61.555,
"completion_tokens": 220,
"tokens_per_second": 3.57,
"content_preview": "We are reflecting on the task: fixing the WebChat which had an API that wasn't running.\n The root cause was that the API was not running. We started it manually but there was no unified service script.\n\n Reflection:\n - Risk: The manual start might be error-prone and not scalable. If multiple components, hard to manage.\n - Waste: We ended up having to start manually, which is not efficient and might lead to repeated issues.\n - Reusable Lesson: We should create a unified service script that starts"
}
]
},
{
"model": "gemma-4-E4B-it-Q4_K_M CPU",
"quality": 0.971,
"avg_latency_seconds": 21.235,
"avg_tokens_per_second": 5.36,
"cases": [
{
"role": "action",
"case": "direct_answer_no_tools",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 35.717,
"completion_tokens": 53,
"tokens_per_second": 1.48,
"content_preview": "{\n \"kind\": \"action_directive\",\n \"intent\": \"answer_directly\"\n ,\n \"risk_level\": \"low\"\n ,\n \"actions\": []\n}\n"
},
{
"role": "action",
"case": "read_specific_file",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 13.325,
"completion_tokens": 88,
"tokens_per_second": 6.6,
"content_preview": "{\n \"kind\": \"action_directive\",\n \"intent\": \"tool_use\",\n \"risk_level\": \"low\",\n \"actions\": [\n {\n \"tool\": \"file_read\",\n \"args\": {\n \"path\": \"CURRENT_STATE.md\"\n }\n }\n ]\n}\n"
},
{
"role": "memory_policy",
"case": "store_user_preference",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 27.13,
"completion_tokens": 98,
"tokens_per_second": 3.61,
"content_preview": "{\n \"should_store\": true,\n \"memory_type\": \"preference\",\n \"summary\": \"Пользователь требует, чтобы ответы всегда были на русском языке и чтобы не запускать sudo без явного подтверждения.\",\n \"importance\": 0.95,\n \"scope\": \"global\",\n \"metadata\": {\n \"task_id\": \"task_pref\"\n }\n}\n"
},
{
"role": "memory_policy",
"case": "ignore_trivial_tool_call",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 10.229,
"completion_tokens": 90,
"tokens_per_second": 8.8,
"content_preview": "{\n \"should_store\": false,\n \"memory_type\": \"note\",\n \"summary\": \"The 'pwd' command was executed, and the output was sent to the user.\",\n \"importance\": 0.1,\n \"scope\": \"conversation\",\n \"metadata\": {\n \"task_id\": \"task_tmp\"\n }\n}\n"
},
{
"role": "recall",
"case": "select_relevant_memory",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 19.391,
"completion_tokens": 62,
"tokens_per_second": 3.2,
"content_preview": "{\n \"relevant_ids\": [\n \"m1\"\n ],\n \"reasoning\": \"Воспоминание m1 напрямую отвечает на вопрос о том, как пользователь хочет, чтобы запускался sudo (с требованием отдельного подтверждения).\"\n}\n"
},
{
"role": "summary",
"case": "preserve_decisions",
"score": 1.0,
"note": "ok",
"elapsed_seconds": 14.371,
"completion_tokens": 88,
"tokens_per_second": 6.12,
"content_preview": "Вот сжатый контекст в 3 пунктах:\n\n1. **Порты:** WebChat использует порт 8000, llama-server — 8081.\n2. **Безопасность:** Для внешних путей требуется предварительное одобрение (approval).\n3. **Хранение разрешений:** `allow_forever` сохраняется на основе хеша нормализованного действия."
},
{
"role": "critic",
"case": "reflection_quality",
"score": 0.8,
"note": "missing=['lesson']",
"elapsed_seconds": 28.484,
"completion_tokens": 220,
"tokens_per_second": 7.72,
"content_preview": "## DuckLM: Critic Reflection\n\n**Task:** Fix WebChat.\n**Observed Root Cause:** API was not running.\n**Action Taken:** Started API manually.\n**Observed Deficiency:** Lack of a unified service script.\n\n---\n\n### 🔍 Reflection Analysis\n\n#### 1. Risk Assessment (What went wrong/could go wrong?)\n\n* **Operational Risk (High):** The immediate risk was service unavailability (WebChat down). Manually starting the API is a brittle, high-touch workaround. If the system restarts, or if the API needs to be re"
}
]
}
]