README MrTrenchTrucker/turbohaul-manager
Turbohaul-Manager
Ollama-shape inference manager using Tom's TurboQuant fork of llama.cpp.
FIFO queue + grace + IDLE_HOT hot-hold + model swap on Nvidia RTX GPU's including Blackwell.
What it does
- Accepts OpenAI / Ollama-shape
/v1/chat/completionsrequests - Single-slot serial sidecar (one llama-server child holds the model)
- MTP speculative decoding (draft-mtp) composes with TurboQuant turbo2/3/4 KV cache quantization in a single llama.cpp binary — faster decode without sacrificing the quantized KV-cache memory footprint
- ACTIVE_MATCH cascade for same-thread follow-ups within a grace window (warm-process reuse)
- IDLE_HOT 5-minute warm-hold after grace expires: same-model follow-ups inherit the warm process; different-model swap tears down + spawns new
- Multiplexed multi-agent serialization on one shared GPU (proven with three concurrent agents — see docs/MULTI_AGENT_SHARING.md)
- Transparent tool-call recovery for jinja-templated GGUFs that emit calls as text JSON in
message.contentinstead of the structuredtool_callsfield (notably Qwen3-family per upstream llama.cpp issues #20809 / #20837 / #20260) — see docs/TOOL_CALL_HANDLING.md - Safety guardrails: refuses spawn when VRAM / RAM / CPU / IO-wait would put the host at risk
Quick start
# Run it (build locally first — see below; no prebuilt registry image is published yet)
docker run --gpus all -p 11401:11401 \
-v $(pwd)/state:/var/lib/turbohaul \
-v $(pwd)/models:/var/lib/turbohaul/import-staging \
turbohaul-manager:v0.3.0
# Build locally (required)
git clone https://github.com/MrTrenchTrucker/turbohaul-manager.git
cd turbohaul-manager
docker build -f Dockerfile.cuda -t turbohaul-manager:v0.3.0 .The -v $(pwd)/state:/var/lib/turbohaul mount is required for production deployment — without it, state.sqlite, manifests/*.yaml, and the blobs/ store live inside the container layer and are destroyed by docker rm or container-layer corruption. See docs/PERSISTENCE_CHECKLIST.md for the full hardening checklist.
API
Compatible with Ollama-shape clients:
GET /api/tags-- list modelsGET /api/show?name=<tag>-- model detailPOST /v1/chat/completions-- OpenAI-shape inference (supportsresponse_formatjson_object + json_schema)POST /api/chat-- Ollama-shape inferencePOST /v1/embeddings-- llama-server embeddings passthroughGET /v1/logging-- paginated audit eventsPUT /api/manifests/{tag}-- register a new model (requires GGUF blob in store; ETag/If-Match atomic concurrency)POST /api/pull-hf-- pull a GGUF from HuggingFacePOST /api/pull-url-- pull a GGUF from arbitrary HTTPS URL (SSRF-guarded)POST /api/import-- import a local GGUF fileGET /status-- live queue + active + idle_hot snapshot
Setting up AI Agents
Pointing an AI agent (Hermes, langchain, llama-index, LiteLLM, raw OpenAI SDK, Ollama clients, etc.) at Turbohaul is two lines:
base_url: http://<turbohaul-host>:11401/v1
api_key: dummy # no auth required on the internal-network portTurbohaul ships with sane defaults for multi-tool-call agent loops — idle_hot_load_seconds=600, grace_seconds=30, streaming SSE pass-through, tool-call field forwarding on both /v1/chat/completions and /api/chat, text-JSON tool-call recovery for jinja-template models that emit calls as content text, and ACTIVE_MATCH warm-slot reuse for same-thread_id follow-ups (sub-second after the first turn).
Full guide: docs/AI_AGENT_SETUP.md — per-agent config recipes (Hermes / OpenAI SDK / langchain / llama-index / LiteLLM / Ollama / curl), multi-tool-call workflow notes, production setup, validation smoke tests, and a troubleshooting table. For the recovery layer specifically, see docs/TOOL_CALL_HANDLING.md.
Multi-agent shared-GPU
Multiple agents can target the same Turbohaul endpoint at the same time. Turbohaul queues their requests, holds the warm model when possible, and cleanly swaps models when a different agent needs a different one. Proven with three concurrent agents running on one Blackwell card with zero force-evictions during a multi-model serialization smoke.
This is sharing-via-serialization, not concurrent-tensor-parallelism. See docs/MULTI_AGENT_SHARING.md for the architecture, the proof, and when this does (and does not) fit your workload.
TurboQuant flag doctrine
The Turbohaul manifest schema includes spawn-time TurboQuant KV-cache flags (turbo2/3/4) and an MTP speculative-decoding flag (--spec-type draft-mtp); the KV-cache flags below should be on by default for production manifests: flash_attn, no_context_shift, cache_reuse: 256, slot_prompt_similarity: 0.5, no_perf. These are spawn argv — manifest PUT does not affect a running llama-server; a cold-spawn (request with body "keep_alive": 0, natural IDLE_HOT teardown, or container restart) is required to pick up changes.
See docs/TURBOQUANT_FLAGS.md for the spawn-vs-request distinction, patching recipe, and verification recipe.
Persistence
Production deployments must bind-mount /var/lib/turbohaul, ship an image tarball backup, mirror configs to a separate host, and have an auto-recovery entry. See docs/PERSISTENCE_CHECKLIST.md for the full deployment persistence checklist.
License
MIT (see LICENSE). All third-party deps audited MIT-compatible (see THIRD_PARTY_NOTICES.md).
Contributors
See CONTRIBUTORS.md. MrTrench (founder) shipped v0.2.3. Release notes in CHANGELOG.md.
