45 lines
1.0 KiB
Markdown
45 lines
1.0 KiB
Markdown
# Local Llama Server
|
|
|
|
DuckLM expects an OpenAI-compatible `llama-server` at `http://127.0.0.1:8081/v1` by default.
|
|
|
|
On the current Radeon RX580 system, `llama.cpp` is built locally with Vulkan:
|
|
|
|
```bash
|
|
bash scripts/llama/build_vulkan.sh
|
|
```
|
|
|
|
The main model is Qwen3.6 35B A3B nonMTP:
|
|
|
|
```text
|
|
models/Qwen3.6/nonMTP/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
|
|
```
|
|
|
|
Start it in the background with:
|
|
|
|
```bash
|
|
bash scripts/llama/start_main.sh start
|
|
```
|
|
|
|
Manage the process:
|
|
|
|
```bash
|
|
bash scripts/llama/start_main.sh status
|
|
bash scripts/llama/start_main.sh logs
|
|
bash scripts/llama/start_main.sh logs --follow
|
|
bash scripts/llama/start_main.sh restart
|
|
bash scripts/llama/start_main.sh stop
|
|
```
|
|
|
|
The local `.env` uses:
|
|
|
|
```env
|
|
DUCK_LLAMA_SERVER_BIN=./vendor/llama.cpp/build/bin/llama-server
|
|
DUCK_CTX_SIZE=4096
|
|
DUCK_N_GPU_LAYERS=20
|
|
DUCK_PARALLEL=1
|
|
DUCK_LLAMA_DEVICE=Vulkan0
|
|
DUCK_LLAMA_EXTRA_ARGS="--reasoning off --cache-ram 0"
|
|
```
|
|
|
|
MTP is available only through `scripts/llama/start_thinker_mtp_experimental.sh` and is not used by the action JSON endpoint by default.
|