One-command demo so the gateway can be exercised end-to-end without a GPU or a real model download: - demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags, /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count and eval_count on the final frame, /api/embed, /api/show, /api/version). Non-root multi-stage Dockerfile, never published (internal network only). - docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground. Mirrors the prod posture (mock-ollama not exposed). - demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with allow_all_models and a fresh API key via the bootstrap CLI inside the container, then prints the key, the playground URL, and five ready-to-paste curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull). ./demo.sh --down tears everything back down with volumes. - playground/index.html — single-file dark-themed UI served same-origin by the gateway at /playground (CORS-free). Per-endpoint About card with method/ auth/streaming badges, a real description, sample request body, sample response, and a footer note. Live SSE/NDJSON rendering of the response. A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh are visibly gated until an API key is in the field; the Base URL is force-pinned to location.origin three times to defeat browser autofill. - docs/ — API.md (full endpoint reference with curl, streaming formats, error model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule, pointing at a real Ollama backend, env reference), THREAT_MODEL.md (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md. mkdocs.yml (Material theme) wires them together.
6.7 KiB
neuronetz-gateway — Operations Runbook
Day-2 operations for the gateway: managing tenants and keys, budgets, model policy, usage,
and the fail-closed behaviors you'll encounter. All administration is via the bootstrap
CLI (SPEC §11), run inside the gateway container. There are no admin HTTP endpoints in the
gateway (that's neuronetz-console's job).
Run the CLI inside the running container:
docker compose exec gateway neuronetz-gateway <command> …In the demo stack, swap the compose file:
docker compose -f docker-compose.demo.yml exec gateway …
Keys
Create a key
docker compose exec gateway neuronetz-gateway create-key --tenant acme --name prod-server-1
# optional: --scopes chat,embeddings (default: chat,embeddings)
The full key is printed exactly once in the form nz_<prefix><secret>. Store it
immediately in your secret manager — it is argon2id-hashed and cannot be recovered. Only the
12-char prefix is retained server-side.
List keys (never shows full keys)
docker compose exec gateway neuronetz-gateway list-keys --tenant acme
# prints: <prefix> status=active name='prod-server-1' created=…
Revoke a key
docker compose exec gateway neuronetz-gateway revoke-key --prefix nz_abc12345
This sets the key status to revoked and writes the gateway.revocations outbox row. A
Postgres NOTIFY on channel key_revoked fires; the gateway evicts the key's Redis cache
entry, so revocation takes effect within ~1 second (SPEC §4.5) without restarting anything.
A subsequent request with that key returns 401.
The console (
neuronetz-console) revokes keys the same way — by inserting intogateway.revocations. The trigger-driven NOTIFY makes it immediate without any cross-service HTTP call.
Rotate a key
There is no in-place rotate. Rotate by: create a new key → deploy it to the client → verify traffic on the new prefix → revoke the old prefix.
Tenants & limits
Create a tenant
docker compose exec gateway neuronetz-gateway create-tenant --name acme \
--rpm 120 --tpm 200000 --concurrent 8
# add --allow-all-models to opt into using any installed model (default: off)
Limits inherit key → tenant: a NULL key-level limit uses the tenant value.
Budgets
Set per-key token budgets (any combination of daily / monthly / total):
docker compose exec gateway neuronetz-gateway set-budget --key nz_abc12345 \
--daily 1000000 --monthly 30000000 --total 500000000
- Budgets are enforced fail-closed: when the binding period hits zero remaining, requests
return 429 with a descriptive error and a
Retry-Afterheader. The binding period and remaining balance are surfaced on every response viaX-Budget-PeriodandX-Budget-Tokens-Remaining(SPEC §6.5). - Live counters live in Redis; the Postgres ledger (
gateway.budget_usage) is the source of truth on period rollover/reset.
Model policy
Set an explicit allowlist (default-deny)
docker compose exec gateway neuronetz-gateway set-models --tenant acme \
--models llama3.1:8b,mistral:7b
The tenant's effective set is allowed_models ∩ discovered — entries that aren't
actually installed on the backend silently never resolve. A request for a model outside the
effective set returns a generic 403 (same response as "doesn't exist" — no enumeration).
Toggle allow_all_models
docker compose exec gateway neuronetz-gateway set-models --tenant acme --allow-all # opt in
docker compose exec gateway neuronetz-gateway set-models --tenant acme --no-allow-all # back to allowlist
With allow_all_models on, the effective set is the live discovered set — any model
pulled into Ollama becomes usable on the next discovery refresh, with no further config
change. This is an audited convenience; prefer explicit allowlists for untrusted tenants
(see THREAT_MODEL.md).
Inspect discovery and effective sets
docker compose exec gateway neuronetz-gateway list-models # live-discovered models
docker compose exec gateway neuronetz-gateway list-models --tenant acme # + that tenant's effective set
Usage
docker compose exec gateway neuronetz-gateway show-usage --tenant acme --period day
# prints: requests=… tokens_in=… tokens_out=… (period: day|month|total)
For per-key forensics and finer slicing, query gateway.audit_log directly (it records
request_id, key_prefix, model, tokens_in/out, status, latency_ms, client_ip).
How model discovery refresh works (SPEC §4.6)
- A background task polls Ollama
GET /api/tagseveryMODEL_DISCOVERY_REFRESH_Sseconds and caches the result in Redis (gateway:models:discovered, TTLMODEL_DISCOVERY_CACHE_TTL_S) plus an in-process copy for hot reads. - A model pulled into Ollama out-of-band appears in
allow_all_modelstenants' effective sets within one refresh interval — no config change. - Discovery is read-only and uses only the allowlisted
/api/tagsendpoint; it never triggers a pull. - To force a faster pickup, lower
MODEL_DISCOVERY_REFRESH_S(the demo uses 15 s).
Fail-closed behaviors to expect
| Symptom | Cause | Correct behavior |
|---|---|---|
503 on every request |
Redis or Postgres-read down | Fail-closed — rate-limit/budget/auth can't be checked, so deny. Restore the backend. |
502 with retry-after |
Ollama unreachable | Circuit breaker opens after 5 consecutive failures, half-opens after 30 s. Check the backend / OLLAMA_BASE_URL. |
403 for a model you "know" exists |
Model not in the tenant's effective set, or discovery cache empty/expired | Check list-models --tenant …; verify the backend is reachable and the model is installed. Empty discovery = deny by design. |
429 with Retry-After |
Rate limit or budget exhausted | Inspect headers (X-RateLimit-*, X-Budget-*); raise limits/budget or wait. |
401 immediately after revoke |
Working as intended | Revocation propagated via NOTIFY → Redis eviction. |
/readyz returns 503 when any dependency (DB, Redis, Ollama) is unreachable; use it as
the load-balancer health gate. /healthz only checks process liveness.
Logs, metrics, audit
- Logs: structured (structlog), JSON in production, to stdout. Keys/secrets are never logged.
- Metrics: Prometheus at
/metrics(loopback only):gateway_requests_total,gateway_tokens_total,gateway_request_duration_seconds, labelled bytenantandmodel(neverkey_id). - Audit log: always-on in
gateway.audit_log. Prompt log is opt-in per key and TTL'd (PROMPT_LOG_DEFAULT_RETENTION_DAYS); a sweeper enforces retention.