demo + playground + docs

One-command demo so the gateway can be exercised end-to-end without a GPU or a real model download: - demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags, /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count and eval_count on the final frame, /api/embed, /api/show, /api/version). Non-root multi-stage Dockerfile, never published (internal network only). - docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground. Mirrors the prod posture (mock-ollama not exposed). - demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with allow_all_models and a fresh API key via the bootstrap CLI inside the container, then prints the key, the playground URL, and five ready-to-paste curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull). ./demo.sh --down tears everything back down with volumes. - playground/index.html — single-file dark-themed UI served same-origin by the gateway at /playground (CORS-free). Per-endpoint About card with method/ auth/streaming badges, a real description, sample request body, sample response, and a footer note. Live SSE/NDJSON rendering of the response. A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh are visibly gated until an API key is in the field; the Base URL is force-pinned to location.origin three times to defeat browser autofill. - docs/ — API.md (full endpoint reference with curl, streaming formats, error model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule, pointing at a real Ollama backend, env reference), THREAT_MODEL.md (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md. mkdocs.yml (Material theme) wires them together.
2026-05-26 20:52:33 +02:00
parent 844b02aade
commit b47a09db91
13 changed files with 2501 additions and 0 deletions
--- a/docs/OPERATIONS.md
+++ b/docs/OPERATIONS.md
@@ -0,0 +1,172 @@
+# neuronetz-gateway — Operations Runbook
+
+Day-2 operations for the gateway: managing tenants and keys, budgets, model policy, usage,
+and the fail-closed behaviors you'll encounter. All administration is via the **bootstrap
+CLI** (SPEC §11), run inside the gateway container. There are no admin HTTP endpoints in the
+gateway (that's `neuronetz-console`'s job).
+
+> Run the CLI inside the running container:
+> ```bash
+> docker compose exec gateway neuronetz-gateway <command> …
+> ```
+> In the demo stack, swap the compose file: `docker compose -f docker-compose.demo.yml exec gateway …`
+
+---
+
+## Keys
+
+### Create a key
+
+```bash
+docker compose exec gateway neuronetz-gateway create-key --tenant acme --name prod-server-1
+# optional: --scopes chat,embeddings   (default: chat,embeddings)
+```
+
+The **full key is printed exactly once** in the form `nz_<prefix><secret>`. Store it
+immediately in your secret manager — it is argon2id-hashed and cannot be recovered. Only the
+12-char `prefix` is retained server-side.
+
+### List keys (never shows full keys)
+
+```bash
+docker compose exec gateway neuronetz-gateway list-keys --tenant acme
+# prints: <prefix>  status=active  name='prod-server-1'  created=…
+```
+
+### Revoke a key
+
+```bash
+docker compose exec gateway neuronetz-gateway revoke-key --prefix nz_abc12345
+```
+
+This sets the key status to `revoked` and writes the `gateway.revocations` outbox row. A
+Postgres `NOTIFY` on channel `key_revoked` fires; the gateway evicts the key's Redis cache
+entry, so revocation takes effect within ~1 second (SPEC §4.5) without restarting anything.
+A subsequent request with that key returns **401**.
+
+> The console (`neuronetz-console`) revokes keys the same way — by inserting into
+> `gateway.revocations`. The trigger-driven NOTIFY makes it immediate without any
+> cross-service HTTP call.
+
+### Rotate a key
+
+There is no in-place rotate. Rotate by: create a new key → deploy it to the client → verify
+traffic on the new prefix → revoke the old prefix.
+
+---
+
+## Tenants & limits
+
+### Create a tenant
+
+```bash
+docker compose exec gateway neuronetz-gateway create-tenant --name acme \
+  --rpm 120 --tpm 200000 --concurrent 8
+# add --allow-all-models to opt into using any installed model (default: off)
+```
+
+Limits inherit **key → tenant**: a `NULL` key-level limit uses the tenant value.
+
+---
+
+## Budgets
+
+Set per-key token budgets (any combination of daily / monthly / total):
+
+```bash
+docker compose exec gateway neuronetz-gateway set-budget --key nz_abc12345 \
+  --daily 1000000 --monthly 30000000 --total 500000000
+```
+
+- Budgets are enforced **fail-closed**: when the binding period hits zero remaining, requests
+  return **429** with a descriptive error and a `Retry-After` header. The binding period and
+  remaining balance are surfaced on every response via `X-Budget-Period` and
+  `X-Budget-Tokens-Remaining` (SPEC §6.5).
+- Live counters live in Redis; the Postgres ledger (`gateway.budget_usage`) is the source of
+  truth on period rollover/reset.
+
+---
+
+## Model policy
+
+### Set an explicit allowlist (default-deny)
+
+```bash
+docker compose exec gateway neuronetz-gateway set-models --tenant acme \
+  --models llama3.1:8b,mistral:7b
+```
+
+The tenant's **effective set** is `allowed_models ∩ discovered` — entries that aren't
+actually installed on the backend silently never resolve. A request for a model outside the
+effective set returns a generic **403** (same response as "doesn't exist" — no enumeration).
+
+### Toggle `allow_all_models`
+
+```bash
+docker compose exec gateway neuronetz-gateway set-models --tenant acme --allow-all      # opt in
+docker compose exec gateway neuronetz-gateway set-models --tenant acme --no-allow-all   # back to allowlist
+```
+
+With `allow_all_models` on, the effective set **is** the live discovered set — any model
+pulled into Ollama becomes usable on the next discovery refresh, with no further config
+change. This is an audited convenience; prefer explicit allowlists for untrusted tenants
+(see [`THREAT_MODEL.md`](THREAT_MODEL.md)).
+
+### Inspect discovery and effective sets
+
+```bash
+docker compose exec gateway neuronetz-gateway list-models                 # live-discovered models
+docker compose exec gateway neuronetz-gateway list-models --tenant acme   # + that tenant's effective set
+```
+
+---
+
+## Usage
+
+```bash
+docker compose exec gateway neuronetz-gateway show-usage --tenant acme --period day
+# prints: requests=…  tokens_in=…  tokens_out=…   (period: day|month|total)
+```
+
+For per-key forensics and finer slicing, query `gateway.audit_log` directly (it records
+`request_id`, `key_prefix`, `model`, `tokens_in/out`, `status`, `latency_ms`, `client_ip`).
+
+---
+
+## How model discovery refresh works (SPEC §4.6)
+
+- A background task polls Ollama `GET /api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds and
+  caches the result in Redis (`gateway:models:discovered`, TTL `MODEL_DISCOVERY_CACHE_TTL_S`)
+  plus an in-process copy for hot reads.
+- A model pulled into Ollama out-of-band appears in `allow_all_models` tenants' effective sets
+  within one refresh interval — no config change.
+- Discovery is **read-only** and uses only the allowlisted `/api/tags` endpoint; it never
+  triggers a pull.
+- To force a faster pickup, lower `MODEL_DISCOVERY_REFRESH_S` (the demo uses 15 s).
+
+---
+
+## Fail-closed behaviors to expect
+
+| Symptom | Cause | Correct behavior |
+|---|---|---|
+| `503` on every request | Redis or Postgres-read down | Fail-closed — rate-limit/budget/auth can't be checked, so deny. Restore the backend. |
+| `502` with retry-after | Ollama unreachable | Circuit breaker opens after 5 consecutive failures, half-opens after 30 s. Check the backend / `OLLAMA_BASE_URL`. |
+| `403` for a model you "know" exists | Model not in the tenant's effective set, **or** discovery cache empty/expired | Check `list-models --tenant …`; verify the backend is reachable and the model is installed. Empty discovery = deny by design. |
+| `429` with `Retry-After` | Rate limit or budget exhausted | Inspect headers (`X-RateLimit-*`, `X-Budget-*`); raise limits/budget or wait. |
+| `401` immediately after revoke | Working as intended | Revocation propagated via NOTIFY → Redis eviction. |
+
+`/readyz` returns `503` when **any** dependency (DB, Redis, Ollama) is unreachable; use it as
+the load-balancer health gate. `/healthz` only checks process liveness.
+
+---
+
+## Logs, metrics, audit
+
+- **Logs:** structured (structlog), JSON in production, to stdout. Keys/secrets are never
+  logged.
+- **Metrics:** Prometheus at `/metrics` (loopback only): `gateway_requests_total`,
+  `gateway_tokens_total`, `gateway_request_duration_seconds`, labelled by `tenant` and
+  `model` (never `key_id`).
+- **Audit log:** always-on in `gateway.audit_log`. **Prompt log** is opt-in per key and TTL'd
+  (`PROMPT_LOG_DEFAULT_RETENTION_DAYS`); a sweeper enforces retention.