Serving and routing local LLMs with LLMhop

Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar.

Alle Beiträge
#nix#llm

High-performance inference servers like vLLM and sglang are built around a simple assumption: one model per process. Each worker loads a single set of weights, binds to its own port, and exposes an OpenAI-compatible API. This is great for throughput, but it leaves you with a fleet of disconnected endpoints the moment you want to serve more than one model on a machine. LLMhop closes that gap from two sides at once: A tiny router that puts a single endpoint in front of those workers, and a NixOS module that can provision the workers themselves.

The router

At its core, LLMhop is a stateless single Go binary with zero third-party dependencies and no CGO. It reads the model field of an incoming OpenAI-compatible request, looks it up in its config, and reverse-proxies the request verbatim to the matching backend. Because it keeps no state, there is no database, no cache, and no background worker, so it is safe to place behind any load balancer. A minimal config.json maps model names to upstream URLs:

{
  "listen": ":8080",
  "models": {
    "qwen3-8b": { "url": "http://localhost:18001" },
    "openai-gpt-4o": {
      "url": "https://api.openai.com",
      "headers": {
        "Authorization": "Bearer ${env:OPENAI_KEY}"
      }
    }
  }
}

Two optional features solve common authentication use cases. First, incoming requests can be gated with a list of bearer tokens, which are compared in constant time and stripped before forwarding, so the client-facing token never leaks upstream. Second, per-model headers can inject an Authorization (or any other) header when forwarding, which lets you consolidate self-hosted workers and hosted providers like OpenAI behind the same endpoint. Secrets never have to live in the config file in plaintext: string values are expanded at startup from ${env:NAME} or ${file:path} references, with the latter resolving against systemd’s $CREDENTIALS_DIRECTORY when present.

The NixOS module

The router alone already removes a lot of glue code, but wiring up the inference servers by hand is still tedious. The NixOS module therefore goes one step further and can run the workers themselves. You declare your models once, and each entry becomes an isolated worker bound to a loopback port, with the matching route registered with the router automatically.

services.llmhop = {
  enable = true;
  llama-cpp.models."qwen3-8b" = {
    port = 18001;
    settings.hf-repo = "unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL";
  };
  vllm.models."llama-3-8b" = {
    port = 20001;
    model = "meta-llama/Meta-Llama-3-8B-Instruct";
  };
};

Three backends are supported and can be enabled side by side: llama.cpp, sglang, and vLLM. The choice of how to run each one reflects how it is best deployed. llama.cpp runs as a native, hardened systemd system unit under DynamicUser. sglang and vLLM are launched as rootless Podman containers through quadlet-nix, each owned by a dedicated lingering system user that holds its own cache directory and rootless container store. No worker ever runs as root, and the router itself runs under the same aggressive sandboxing as the llama.cpp units.

Running the GPU backends as systemd user units rather than system units is a deliberate workaround. When systemd’s system manager launches a rootless container, Podman ends up in a UID-mapped namespace, and nvidia-cdi-hook then fails to read the OCI bundle’s config, so the GPU never gets exposed. Running each Quadlet unit under a real, lingering user’s systemd instance keeps Podman in a keep-id-style mapping where the CDI hook works and the GPU is passed through correctly.

sglang already ships its own router, the SGL Model Gateway, so it is worth being clear about why LLMhop exists alongside it. That gateway speaks sglang’s own API rather than a generic OpenAI surface: It probes each worker through sglang-specific endpoints like /get_model_info, and it only adds value when you actually want sglang’s IGW dispatch features such as custom routing or prefix caching across workers. It is also a much heavier component, a full Rust service running as its own container, where LLMhop is a single dependency-free Go binary that fronts llama.cpp, vLLM, and hosted providers just as happily as sglang. For maximum flexibility, the module does not force the choice: the SGL Model Gateway can be enabled with a single option and placed in front of the sglang workers, while LLMhop keeps routing across every backend.

The result is that the whole stack, from pulling weights to the public endpoint, lives in one declarative configuration. The full list of per-backend options is available in the reference, and feedback on the project is always welcome.