ducklm/docs/bench/utility_model_bench_2026052...

5.2 KiB

Utility Role Model Benchmark

Scope: service roles only (action, memory_policy, recall, summary, critic). The main user-facing thinker is not evaluated for replacement here.

Model Quality Avg latency, s Avg tok/s Notes
Qwen3.6-35B nonMTP GPU baseline 0.97 17.94 4.51 critic/reflection_quality: missing=['lesson']
Menlo_Lucy-Q4_K_M CPU 0.77 4.41 16.21 memory_policy/ignore_trivial_tool_call: stored_trivial={'should_store': True, 'memory_type': 'fact', 'summary': 'Password was successfully launched and user was informed.', 'importance': 0.7, 'scope': 'global', 'metadata': {}}; recall/select_relevant_memory: wrong_ids=[]; summary/preserve_decisions: missing=['approval']
Qwen3.5-9B-GLM5.1-Distill-v1-Q4_K_M CPU 0.40 61.94 2.56 memory_policy/store_user_preference: invalid_json: Expecting value: line 1 column 1 (char 0); memory_policy/ignore_trivial_tool_call: invalid_json: Expecting value: line 1 column 1 (char 0); recall/select_relevant_memory: invalid_json: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
X-Coder-SFT-Qwen3-8B.Q6_K CPU 0.76 60.12 2.51 action/direct_answer_no_tools: invalid_json: Expecting ',' delimiter: line 13 column 6 (char 632); memory_policy/ignore_trivial_tool_call: stored_trivial={'should_store': True, 'memory_type': 'event', 'summary': 'User executed pwd command and received /tmp/project as output.', 'importance': 0.8, 'scope': 'conversation', 'metadata': {}}
gemma-4-E4B-it-Q4_K_M CPU 0.97 21.23 5.36 critic/reflection_quality: missing=['lesson']

Case Details

Qwen3.6-35B nonMTP GPU baseline

Role Case Score Latency, s tok/s Note
action direct_answer_no_tools 1.00 15.31 2.94 ok
action read_specific_file 1.00 19.61 4.13 ok
memory_policy store_user_preference 1.00 18.53 4.75 ok
memory_policy ignore_trivial_tool_call 1.00 15.00 4.07 ok
recall select_relevant_memory 1.00 15.09 4.38 ok
summary preserve_decisions 1.00 9.95 4.42 ok
critic reflection_quality 0.80 32.09 6.86 missing=['lesson']

Menlo_Lucy-Q4_K_M CPU

Role Case Score Latency, s tok/s Note
action direct_answer_no_tools 1.00 3.23 9.60 ok
action read_specific_file 1.00 3.03 15.84 ok
memory_policy store_user_preference 1.00 3.62 14.92 ok
memory_policy ignore_trivial_tool_call 0.30 3.19 18.17 stored_trivial={'should_store': True, 'memory_type': 'fact', 'summary': 'Password was successfully launched and user was informed.', 'importance': 0.7, 'scope': 'global', 'metadata': {}}
recall select_relevant_memory 0.30 3.74 16.05 wrong_ids=[]
summary preserve_decisions 0.80 3.33 18.29 missing=['approval']
critic reflection_quality 1.00 10.70 20.57 ok

Qwen3.5-9B-GLM5.1-Distill-v1-Q4_K_M CPU

Role Case Score Latency, s tok/s Note
action direct_answer_no_tools 1.00 68.08 1.06 ok
action read_specific_file 1.00 72.15 1.19 ok
memory_policy store_user_preference 0.00 67.76 2.66 invalid_json: Expecting value: line 1 column 1 (char 0)
memory_policy ignore_trivial_tool_call 0.00 64.65 2.47 invalid_json: Expecting value: line 1 column 1 (char 0)
recall select_relevant_memory 0.00 59.45 2.69 invalid_json: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
summary preserve_decisions 0.20 47.05 3.83 missing=['8000', '8081', 'approval', 'allow_forever']
critic reflection_quality 0.60 54.43 4.04 missing=['risk', 'lesson']

X-Coder-SFT-Qwen3-8B.Q6_K CPU

Role Case Score Latency, s tok/s Note
action direct_answer_no_tools 0.00 121.05 1.49 invalid_json: Expecting ',' delimiter: line 13 column 6 (char 632)
action read_specific_file 1.00 37.56 3.57 ok
memory_policy store_user_preference 1.00 66.98 1.19 ok
memory_policy ignore_trivial_tool_call 0.30 21.77 2.85 stored_trivial={'should_store': True, 'memory_type': 'event', 'summary': 'User executed pwd command and received /tmp/project as output.', 'importance': 0.8, 'scope': 'conversation', 'metadata': {}}
recall select_relevant_memory 1.00 58.66 1.53 ok
summary preserve_decisions 1.00 53.24 3.38 ok
critic reflection_quality 1.00 61.55 3.57 ok

gemma-4-E4B-it-Q4_K_M CPU

Role Case Score Latency, s tok/s Note
action direct_answer_no_tools 1.00 35.72 1.48 ok
action read_specific_file 1.00 13.32 6.60 ok
memory_policy store_user_preference 1.00 27.13 3.61 ok
memory_policy ignore_trivial_tool_call 1.00 10.23 8.80 ok
recall select_relevant_memory 1.00 19.39 3.20 ok
summary preserve_decisions 1.00 14.37 6.12 ok
critic reflection_quality 0.80 28.48 7.72 missing=['lesson']