70 lines
5.2 KiB
Markdown
70 lines
5.2 KiB
Markdown
# Utility Role Model Benchmark
|
|
|
|
Scope: service roles only (`action`, `memory_policy`, `recall`, `summary`, `critic`).
|
|
The main user-facing thinker is not evaluated for replacement here.
|
|
|
|
| Model | Quality | Avg latency, s | Avg tok/s | Notes |
|
|
| --- | ---: | ---: | ---: | --- |
|
|
| Qwen3.6-35B nonMTP GPU baseline | 0.97 | 17.94 | 4.51 | critic/reflection_quality: missing=['lesson'] |
|
|
| Menlo_Lucy-Q4_K_M CPU | 0.77 | 4.41 | 16.21 | memory_policy/ignore_trivial_tool_call: stored_trivial={'should_store': True, 'memory_type': 'fact', 'summary': 'Password was successfully launched and user was informed.', 'importance': 0.7, 'scope': 'global', 'metadata': {}}; recall/select_relevant_memory: wrong_ids=[]; summary/preserve_decisions: missing=['approval'] |
|
|
| Qwen3.5-9B-GLM5.1-Distill-v1-Q4_K_M CPU | 0.40 | 61.94 | 2.56 | memory_policy/store_user_preference: invalid_json: Expecting value: line 1 column 1 (char 0); memory_policy/ignore_trivial_tool_call: invalid_json: Expecting value: line 1 column 1 (char 0); recall/select_relevant_memory: invalid_json: Expecting property name enclosed in double quotes: line 1 column 2 (char 1) |
|
|
| X-Coder-SFT-Qwen3-8B.Q6_K CPU | 0.76 | 60.12 | 2.51 | action/direct_answer_no_tools: invalid_json: Expecting ',' delimiter: line 13 column 6 (char 632); memory_policy/ignore_trivial_tool_call: stored_trivial={'should_store': True, 'memory_type': 'event', 'summary': 'User executed pwd command and received /tmp/project as output.', 'importance': 0.8, 'scope': 'conversation', 'metadata': {}} |
|
|
| gemma-4-E4B-it-Q4_K_M CPU | 0.97 | 21.23 | 5.36 | critic/reflection_quality: missing=['lesson'] |
|
|
|
|
## Case Details
|
|
|
|
### Qwen3.6-35B nonMTP GPU baseline
|
|
| Role | Case | Score | Latency, s | tok/s | Note |
|
|
| --- | --- | ---: | ---: | ---: | --- |
|
|
| action | direct_answer_no_tools | 1.00 | 15.31 | 2.94 | ok |
|
|
| action | read_specific_file | 1.00 | 19.61 | 4.13 | ok |
|
|
| memory_policy | store_user_preference | 1.00 | 18.53 | 4.75 | ok |
|
|
| memory_policy | ignore_trivial_tool_call | 1.00 | 15.00 | 4.07 | ok |
|
|
| recall | select_relevant_memory | 1.00 | 15.09 | 4.38 | ok |
|
|
| summary | preserve_decisions | 1.00 | 9.95 | 4.42 | ok |
|
|
| critic | reflection_quality | 0.80 | 32.09 | 6.86 | missing=['lesson'] |
|
|
|
|
### Menlo_Lucy-Q4_K_M CPU
|
|
| Role | Case | Score | Latency, s | tok/s | Note |
|
|
| --- | --- | ---: | ---: | ---: | --- |
|
|
| action | direct_answer_no_tools | 1.00 | 3.23 | 9.60 | ok |
|
|
| action | read_specific_file | 1.00 | 3.03 | 15.84 | ok |
|
|
| memory_policy | store_user_preference | 1.00 | 3.62 | 14.92 | ok |
|
|
| memory_policy | ignore_trivial_tool_call | 0.30 | 3.19 | 18.17 | stored_trivial={'should_store': True, 'memory_type': 'fact', 'summary': 'Password was successfully launched and user was informed.', 'importance': 0.7, 'scope': 'global', 'metadata': {}} |
|
|
| recall | select_relevant_memory | 0.30 | 3.74 | 16.05 | wrong_ids=[] |
|
|
| summary | preserve_decisions | 0.80 | 3.33 | 18.29 | missing=['approval'] |
|
|
| critic | reflection_quality | 1.00 | 10.70 | 20.57 | ok |
|
|
|
|
### Qwen3.5-9B-GLM5.1-Distill-v1-Q4_K_M CPU
|
|
| Role | Case | Score | Latency, s | tok/s | Note |
|
|
| --- | --- | ---: | ---: | ---: | --- |
|
|
| action | direct_answer_no_tools | 1.00 | 68.08 | 1.06 | ok |
|
|
| action | read_specific_file | 1.00 | 72.15 | 1.19 | ok |
|
|
| memory_policy | store_user_preference | 0.00 | 67.76 | 2.66 | invalid_json: Expecting value: line 1 column 1 (char 0) |
|
|
| memory_policy | ignore_trivial_tool_call | 0.00 | 64.65 | 2.47 | invalid_json: Expecting value: line 1 column 1 (char 0) |
|
|
| recall | select_relevant_memory | 0.00 | 59.45 | 2.69 | invalid_json: Expecting property name enclosed in double quotes: line 1 column 2 (char 1) |
|
|
| summary | preserve_decisions | 0.20 | 47.05 | 3.83 | missing=['8000', '8081', 'approval', 'allow_forever'] |
|
|
| critic | reflection_quality | 0.60 | 54.43 | 4.04 | missing=['risk', 'lesson'] |
|
|
|
|
### X-Coder-SFT-Qwen3-8B.Q6_K CPU
|
|
| Role | Case | Score | Latency, s | tok/s | Note |
|
|
| --- | --- | ---: | ---: | ---: | --- |
|
|
| action | direct_answer_no_tools | 0.00 | 121.05 | 1.49 | invalid_json: Expecting ',' delimiter: line 13 column 6 (char 632) |
|
|
| action | read_specific_file | 1.00 | 37.56 | 3.57 | ok |
|
|
| memory_policy | store_user_preference | 1.00 | 66.98 | 1.19 | ok |
|
|
| memory_policy | ignore_trivial_tool_call | 0.30 | 21.77 | 2.85 | stored_trivial={'should_store': True, 'memory_type': 'event', 'summary': 'User executed pwd command and received /tmp/project as output.', 'importance': 0.8, 'scope': 'conversation', 'metadata': {}} |
|
|
| recall | select_relevant_memory | 1.00 | 58.66 | 1.53 | ok |
|
|
| summary | preserve_decisions | 1.00 | 53.24 | 3.38 | ok |
|
|
| critic | reflection_quality | 1.00 | 61.55 | 3.57 | ok |
|
|
|
|
### gemma-4-E4B-it-Q4_K_M CPU
|
|
| Role | Case | Score | Latency, s | tok/s | Note |
|
|
| --- | --- | ---: | ---: | ---: | --- |
|
|
| action | direct_answer_no_tools | 1.00 | 35.72 | 1.48 | ok |
|
|
| action | read_specific_file | 1.00 | 13.32 | 6.60 | ok |
|
|
| memory_policy | store_user_preference | 1.00 | 27.13 | 3.61 | ok |
|
|
| memory_policy | ignore_trivial_tool_call | 1.00 | 10.23 | 8.80 | ok |
|
|
| recall | select_relevant_memory | 1.00 | 19.39 | 3.20 | ok |
|
|
| summary | preserve_decisions | 1.00 | 14.37 | 6.12 | ok |
|
|
| critic | reflection_quality | 0.80 | 28.48 | 7.72 | missing=['lesson'] |
|