Files
neuronetz-gateway/docs/OPERATIONS.md
Stephan Berbig b47a09db91 demo + playground + docs
One-command demo so the gateway can be exercised end-to-end without a GPU or a
real model download:

- demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags,
  /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count
  and eval_count on the final frame, /api/embed, /api/show, /api/version).
  Non-root multi-stage Dockerfile, never published (internal network only).
- docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with
  PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground.
  Mirrors the prod posture (mock-ollama not exposed).
- demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with
  allow_all_models and a fresh API key via the bootstrap CLI inside the
  container, then prints the key, the playground URL, and five ready-to-paste
  curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull).
  ./demo.sh --down tears everything back down with volumes.
- playground/index.html — single-file dark-themed UI served same-origin by
  the gateway at /playground (CORS-free). Per-endpoint About card with method/
  auth/streaming badges, a real description, sample request body, sample
  response, and a footer note. Live SSE/NDJSON rendering of the response.
  A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh
  are visibly gated until an API key is in the field; the Base URL is
  force-pinned to location.origin three times to defeat browser autofill.
- docs/ — API.md (full endpoint reference with curl, streaming formats, error
  model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery
  + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule,
  pointing at a real Ollama backend, env reference), THREAT_MODEL.md
  (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md
  (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md.
  mkdocs.yml (Material theme) wires them together.
2026-05-26 20:52:33 +02:00

173 lines
6.7 KiB
Markdown

# neuronetz-gateway — Operations Runbook
Day-2 operations for the gateway: managing tenants and keys, budgets, model policy, usage,
and the fail-closed behaviors you'll encounter. All administration is via the **bootstrap
CLI** (SPEC §11), run inside the gateway container. There are no admin HTTP endpoints in the
gateway (that's `neuronetz-console`'s job).
> Run the CLI inside the running container:
> ```bash
> docker compose exec gateway neuronetz-gateway <command> …
> ```
> In the demo stack, swap the compose file: `docker compose -f docker-compose.demo.yml exec gateway …`
---
## Keys
### Create a key
```bash
docker compose exec gateway neuronetz-gateway create-key --tenant acme --name prod-server-1
# optional: --scopes chat,embeddings (default: chat,embeddings)
```
The **full key is printed exactly once** in the form `nz_<prefix><secret>`. Store it
immediately in your secret manager — it is argon2id-hashed and cannot be recovered. Only the
12-char `prefix` is retained server-side.
### List keys (never shows full keys)
```bash
docker compose exec gateway neuronetz-gateway list-keys --tenant acme
# prints: <prefix> status=active name='prod-server-1' created=…
```
### Revoke a key
```bash
docker compose exec gateway neuronetz-gateway revoke-key --prefix nz_abc12345
```
This sets the key status to `revoked` and writes the `gateway.revocations` outbox row. A
Postgres `NOTIFY` on channel `key_revoked` fires; the gateway evicts the key's Redis cache
entry, so revocation takes effect within ~1 second (SPEC §4.5) without restarting anything.
A subsequent request with that key returns **401**.
> The console (`neuronetz-console`) revokes keys the same way — by inserting into
> `gateway.revocations`. The trigger-driven NOTIFY makes it immediate without any
> cross-service HTTP call.
### Rotate a key
There is no in-place rotate. Rotate by: create a new key → deploy it to the client → verify
traffic on the new prefix → revoke the old prefix.
---
## Tenants & limits
### Create a tenant
```bash
docker compose exec gateway neuronetz-gateway create-tenant --name acme \
--rpm 120 --tpm 200000 --concurrent 8
# add --allow-all-models to opt into using any installed model (default: off)
```
Limits inherit **key → tenant**: a `NULL` key-level limit uses the tenant value.
---
## Budgets
Set per-key token budgets (any combination of daily / monthly / total):
```bash
docker compose exec gateway neuronetz-gateway set-budget --key nz_abc12345 \
--daily 1000000 --monthly 30000000 --total 500000000
```
- Budgets are enforced **fail-closed**: when the binding period hits zero remaining, requests
return **429** with a descriptive error and a `Retry-After` header. The binding period and
remaining balance are surfaced on every response via `X-Budget-Period` and
`X-Budget-Tokens-Remaining` (SPEC §6.5).
- Live counters live in Redis; the Postgres ledger (`gateway.budget_usage`) is the source of
truth on period rollover/reset.
---
## Model policy
### Set an explicit allowlist (default-deny)
```bash
docker compose exec gateway neuronetz-gateway set-models --tenant acme \
--models llama3.1:8b,mistral:7b
```
The tenant's **effective set** is `allowed_models ∩ discovered` — entries that aren't
actually installed on the backend silently never resolve. A request for a model outside the
effective set returns a generic **403** (same response as "doesn't exist" — no enumeration).
### Toggle `allow_all_models`
```bash
docker compose exec gateway neuronetz-gateway set-models --tenant acme --allow-all # opt in
docker compose exec gateway neuronetz-gateway set-models --tenant acme --no-allow-all # back to allowlist
```
With `allow_all_models` on, the effective set **is** the live discovered set — any model
pulled into Ollama becomes usable on the next discovery refresh, with no further config
change. This is an audited convenience; prefer explicit allowlists for untrusted tenants
(see [`THREAT_MODEL.md`](THREAT_MODEL.md)).
### Inspect discovery and effective sets
```bash
docker compose exec gateway neuronetz-gateway list-models # live-discovered models
docker compose exec gateway neuronetz-gateway list-models --tenant acme # + that tenant's effective set
```
---
## Usage
```bash
docker compose exec gateway neuronetz-gateway show-usage --tenant acme --period day
# prints: requests=… tokens_in=… tokens_out=… (period: day|month|total)
```
For per-key forensics and finer slicing, query `gateway.audit_log` directly (it records
`request_id`, `key_prefix`, `model`, `tokens_in/out`, `status`, `latency_ms`, `client_ip`).
---
## How model discovery refresh works (SPEC §4.6)
- A background task polls Ollama `GET /api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds and
caches the result in Redis (`gateway:models:discovered`, TTL `MODEL_DISCOVERY_CACHE_TTL_S`)
plus an in-process copy for hot reads.
- A model pulled into Ollama out-of-band appears in `allow_all_models` tenants' effective sets
within one refresh interval — no config change.
- Discovery is **read-only** and uses only the allowlisted `/api/tags` endpoint; it never
triggers a pull.
- To force a faster pickup, lower `MODEL_DISCOVERY_REFRESH_S` (the demo uses 15 s).
---
## Fail-closed behaviors to expect
| Symptom | Cause | Correct behavior |
|---|---|---|
| `503` on every request | Redis or Postgres-read down | Fail-closed — rate-limit/budget/auth can't be checked, so deny. Restore the backend. |
| `502` with retry-after | Ollama unreachable | Circuit breaker opens after 5 consecutive failures, half-opens after 30 s. Check the backend / `OLLAMA_BASE_URL`. |
| `403` for a model you "know" exists | Model not in the tenant's effective set, **or** discovery cache empty/expired | Check `list-models --tenant …`; verify the backend is reachable and the model is installed. Empty discovery = deny by design. |
| `429` with `Retry-After` | Rate limit or budget exhausted | Inspect headers (`X-RateLimit-*`, `X-Budget-*`); raise limits/budget or wait. |
| `401` immediately after revoke | Working as intended | Revocation propagated via NOTIFY → Redis eviction. |
`/readyz` returns `503` when **any** dependency (DB, Redis, Ollama) is unreachable; use it as
the load-balancer health gate. `/healthz` only checks process liveness.
---
## Logs, metrics, audit
- **Logs:** structured (structlog), JSON in production, to stdout. Keys/secrets are never
logged.
- **Metrics:** Prometheus at `/metrics` (loopback only): `gateway_requests_total`,
`gateway_tokens_total`, `gateway_request_duration_seconds`, labelled by `tenant` and
`model` (never `key_id`).
- **Audit log:** always-on in `gateway.audit_log`. **Prompt log** is opt-in per key and TTL'd
(`PROMPT_LOG_DEFAULT_RETENTION_DAYS`); a sweeper enforces retention.