ducklm/docs/local_llama_server.md

1.0 KiB

Local Llama Server

DuckLM expects an OpenAI-compatible llama-server at http://127.0.0.1:8081/v1 by default.

On the current Radeon RX580 system, llama.cpp is built locally with Vulkan:

bash scripts/llama/build_vulkan.sh

The main model is Qwen3.6 35B A3B nonMTP:

models/Qwen3.6/nonMTP/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

Start it in the background with:

bash scripts/llama/start_main.sh start

Manage the process:

bash scripts/llama/start_main.sh status
bash scripts/llama/start_main.sh logs
bash scripts/llama/start_main.sh logs --follow
bash scripts/llama/start_main.sh restart
bash scripts/llama/start_main.sh stop

The local .env uses:

DUCK_LLAMA_SERVER_BIN=./vendor/llama.cpp/build/bin/llama-server
DUCK_CTX_SIZE=4096
DUCK_N_GPU_LAYERS=20
DUCK_PARALLEL=1
DUCK_LLAMA_DEVICE=Vulkan0
DUCK_LLAMA_EXTRA_ARGS="--reasoning off --cache-ram 0"

MTP is available only through scripts/llama/start_thinker_mtp_experimental.sh and is not used by the action JSON endpoint by default.