Files

Stephan Berbig b47a09db91 demo + playground + docs

One-command demo so the gateway can be exercised end-to-end without a GPU or a
real model download:

- demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags,
  /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count
  and eval_count on the final frame, /api/embed, /api/show, /api/version).
  Non-root multi-stage Dockerfile, never published (internal network only).
- docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with
  PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground.
  Mirrors the prod posture (mock-ollama not exposed).
- demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with
  allow_all_models and a fresh API key via the bootstrap CLI inside the
  container, then prints the key, the playground URL, and five ready-to-paste
  curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull).
  ./demo.sh --down tears everything back down with volumes.
- playground/index.html — single-file dark-themed UI served same-origin by
  the gateway at /playground (CORS-free). Per-endpoint About card with method/
  auth/streaming badges, a real description, sample request body, sample
  response, and a footer note. Live SSE/NDJSON rendering of the response.
  A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh
  are visibly gated until an API key is in the field; the Base URL is
  force-pinned to location.origin three times to defeat browser autofill.
- docs/ — API.md (full endpoint reference with curl, streaming formats, error
  model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery
  + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule,
  pointing at a real Ollama backend, env reference), THREAT_MODEL.md
  (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md
  (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md.
  mkdocs.yml (Material theme) wires them together.

2026-05-26 20:52:33 +02:00

6.7 KiB

Raw Blame History

neuronetz-gateway — Operations Runbook

Day-2 operations for the gateway: managing tenants and keys, budgets, model policy, usage, and the fail-closed behaviors you'll encounter. All administration is via the bootstrap CLI (SPEC §11), run inside the gateway container. There are no admin HTTP endpoints in the gateway (that's neuronetz-console's job).

Run the CLI inside the running container:
docker compose exec gateway neuronetz-gateway <command> …
In the demo stack, swap the compose file: docker compose -f docker-compose.demo.yml exec gateway …

Keys

Create a key

docker compose exec gateway neuronetz-gateway create-key --tenant acme --name prod-server-1
# optional: --scopes chat,embeddings   (default: chat,embeddings)

The full key is printed exactly once in the form nz_<prefix><secret>. Store it immediately in your secret manager — it is argon2id-hashed and cannot be recovered. Only the 12-char prefix is retained server-side.

List keys (never shows full keys)

docker compose exec gateway neuronetz-gateway list-keys --tenant acme
# prints: <prefix>  status=active  name='prod-server-1'  created=…

Revoke a key

docker compose exec gateway neuronetz-gateway revoke-key --prefix nz_abc12345

This sets the key status to revoked and writes the gateway.revocations outbox row. A Postgres NOTIFY on channel key_revoked fires; the gateway evicts the key's Redis cache entry, so revocation takes effect within ~1 second (SPEC §4.5) without restarting anything. A subsequent request with that key returns 401.

The console (neuronetz-console) revokes keys the same way — by inserting into gateway.revocations. The trigger-driven NOTIFY makes it immediate without any cross-service HTTP call.

Rotate a key

There is no in-place rotate. Rotate by: create a new key → deploy it to the client → verify traffic on the new prefix → revoke the old prefix.

Tenants & limits

Create a tenant

docker compose exec gateway neuronetz-gateway create-tenant --name acme \
  --rpm 120 --tpm 200000 --concurrent 8
# add --allow-all-models to opt into using any installed model (default: off)

Limits inherit key → tenant: a NULL key-level limit uses the tenant value.

Budgets

Set per-key token budgets (any combination of daily / monthly / total):

docker compose exec gateway neuronetz-gateway set-budget --key nz_abc12345 \
  --daily 1000000 --monthly 30000000 --total 500000000

Budgets are enforced fail-closed: when the binding period hits zero remaining, requests return 429 with a descriptive error and a Retry-After header. The binding period and remaining balance are surfaced on every response via X-Budget-Period and X-Budget-Tokens-Remaining (SPEC §6.5).
Live counters live in Redis; the Postgres ledger (gateway.budget_usage) is the source of truth on period rollover/reset.

Model policy

Set an explicit allowlist (default-deny)

docker compose exec gateway neuronetz-gateway set-models --tenant acme \
  --models llama3.1:8b,mistral:7b

The tenant's effective set is allowed_models ∩ discovered — entries that aren't actually installed on the backend silently never resolve. A request for a model outside the effective set returns a generic 403 (same response as "doesn't exist" — no enumeration).

Toggle `allow_all_models`

docker compose exec gateway neuronetz-gateway set-models --tenant acme --allow-all      # opt in
docker compose exec gateway neuronetz-gateway set-models --tenant acme --no-allow-all   # back to allowlist

With allow_all_models on, the effective set is the live discovered set — any model pulled into Ollama becomes usable on the next discovery refresh, with no further config change. This is an audited convenience; prefer explicit allowlists for untrusted tenants (see THREAT_MODEL.md).

Inspect discovery and effective sets

docker compose exec gateway neuronetz-gateway list-models                 # live-discovered models
docker compose exec gateway neuronetz-gateway list-models --tenant acme   # + that tenant's effective set

Usage

docker compose exec gateway neuronetz-gateway show-usage --tenant acme --period day
# prints: requests=…  tokens_in=…  tokens_out=…   (period: day|month|total)

For per-key forensics and finer slicing, query gateway.audit_log directly (it records request_id, key_prefix, model, tokens_in/out, status, latency_ms, client_ip).

How model discovery refresh works (SPEC §4.6)

A background task polls Ollama GET /api/tags every MODEL_DISCOVERY_REFRESH_S seconds and caches the result in Redis (gateway:models:discovered, TTL MODEL_DISCOVERY_CACHE_TTL_S) plus an in-process copy for hot reads.
A model pulled into Ollama out-of-band appears in allow_all_models tenants' effective sets within one refresh interval — no config change.
Discovery is read-only and uses only the allowlisted /api/tags endpoint; it never triggers a pull.
To force a faster pickup, lower MODEL_DISCOVERY_REFRESH_S (the demo uses 15 s).

Fail-closed behaviors to expect

Symptom	Cause	Correct behavior
`503` on every request	Redis or Postgres-read down	Fail-closed — rate-limit/budget/auth can't be checked, so deny. Restore the backend.
`502` with retry-after	Ollama unreachable	Circuit breaker opens after 5 consecutive failures, half-opens after 30 s. Check the backend / `OLLAMA_BASE_URL`.
`403` for a model you "know" exists	Model not in the tenant's effective set, or discovery cache empty/expired	Check `list-models --tenant …`; verify the backend is reachable and the model is installed. Empty discovery = deny by design.
`429` with `Retry-After`	Rate limit or budget exhausted	Inspect headers (`X-RateLimit-`, `X-Budget-`); raise limits/budget or wait.
`401` immediately after revoke	Working as intended	Revocation propagated via NOTIFY → Redis eviction.

/readyz returns 503 when any dependency (DB, Redis, Ollama) is unreachable; use it as the load-balancer health gate. /healthz only checks process liveness.

Logs, metrics, audit

Logs: structured (structlog), JSON in production, to stdout. Keys/secrets are never logged.
Metrics: Prometheus at /metrics (loopback only): gateway_requests_total, gateway_tokens_total, gateway_request_duration_seconds, labelled by tenant and model (never key_id).
Audit log: always-on in gateway.audit_log. Prompt log is opt-in per key and TTL'd (PROMPT_LOG_DEFAULT_RETENTION_DAYS); a sweeper enforces retention.

6.7 KiB Raw Blame History