demo + playground + docs
One-command demo so the gateway can be exercised end-to-end without a GPU or a real model download: - demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags, /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count and eval_count on the final frame, /api/embed, /api/show, /api/version). Non-root multi-stage Dockerfile, never published (internal network only). - docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground. Mirrors the prod posture (mock-ollama not exposed). - demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with allow_all_models and a fresh API key via the bootstrap CLI inside the container, then prints the key, the playground URL, and five ready-to-paste curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull). ./demo.sh --down tears everything back down with volumes. - playground/index.html — single-file dark-themed UI served same-origin by the gateway at /playground (CORS-free). Per-endpoint About card with method/ auth/streaming badges, a real description, sample request body, sample response, and a footer note. Live SSE/NDJSON rendering of the response. A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh are visibly gated until an API key is in the field; the Base URL is force-pinned to location.origin three times to defeat browser autofill. - docs/ — API.md (full endpoint reference with curl, streaming formats, error model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule, pointing at a real Ollama backend, env reference), THREAT_MODEL.md (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md. mkdocs.yml (Material theme) wires them together.
This commit is contained in:
172
docs/OPERATIONS.md
Normal file
172
docs/OPERATIONS.md
Normal file
@@ -0,0 +1,172 @@
|
||||
# neuronetz-gateway — Operations Runbook
|
||||
|
||||
Day-2 operations for the gateway: managing tenants and keys, budgets, model policy, usage,
|
||||
and the fail-closed behaviors you'll encounter. All administration is via the **bootstrap
|
||||
CLI** (SPEC §11), run inside the gateway container. There are no admin HTTP endpoints in the
|
||||
gateway (that's `neuronetz-console`'s job).
|
||||
|
||||
> Run the CLI inside the running container:
|
||||
> ```bash
|
||||
> docker compose exec gateway neuronetz-gateway <command> …
|
||||
> ```
|
||||
> In the demo stack, swap the compose file: `docker compose -f docker-compose.demo.yml exec gateway …`
|
||||
|
||||
---
|
||||
|
||||
## Keys
|
||||
|
||||
### Create a key
|
||||
|
||||
```bash
|
||||
docker compose exec gateway neuronetz-gateway create-key --tenant acme --name prod-server-1
|
||||
# optional: --scopes chat,embeddings (default: chat,embeddings)
|
||||
```
|
||||
|
||||
The **full key is printed exactly once** in the form `nz_<prefix><secret>`. Store it
|
||||
immediately in your secret manager — it is argon2id-hashed and cannot be recovered. Only the
|
||||
12-char `prefix` is retained server-side.
|
||||
|
||||
### List keys (never shows full keys)
|
||||
|
||||
```bash
|
||||
docker compose exec gateway neuronetz-gateway list-keys --tenant acme
|
||||
# prints: <prefix> status=active name='prod-server-1' created=…
|
||||
```
|
||||
|
||||
### Revoke a key
|
||||
|
||||
```bash
|
||||
docker compose exec gateway neuronetz-gateway revoke-key --prefix nz_abc12345
|
||||
```
|
||||
|
||||
This sets the key status to `revoked` and writes the `gateway.revocations` outbox row. A
|
||||
Postgres `NOTIFY` on channel `key_revoked` fires; the gateway evicts the key's Redis cache
|
||||
entry, so revocation takes effect within ~1 second (SPEC §4.5) without restarting anything.
|
||||
A subsequent request with that key returns **401**.
|
||||
|
||||
> The console (`neuronetz-console`) revokes keys the same way — by inserting into
|
||||
> `gateway.revocations`. The trigger-driven NOTIFY makes it immediate without any
|
||||
> cross-service HTTP call.
|
||||
|
||||
### Rotate a key
|
||||
|
||||
There is no in-place rotate. Rotate by: create a new key → deploy it to the client → verify
|
||||
traffic on the new prefix → revoke the old prefix.
|
||||
|
||||
---
|
||||
|
||||
## Tenants & limits
|
||||
|
||||
### Create a tenant
|
||||
|
||||
```bash
|
||||
docker compose exec gateway neuronetz-gateway create-tenant --name acme \
|
||||
--rpm 120 --tpm 200000 --concurrent 8
|
||||
# add --allow-all-models to opt into using any installed model (default: off)
|
||||
```
|
||||
|
||||
Limits inherit **key → tenant**: a `NULL` key-level limit uses the tenant value.
|
||||
|
||||
---
|
||||
|
||||
## Budgets
|
||||
|
||||
Set per-key token budgets (any combination of daily / monthly / total):
|
||||
|
||||
```bash
|
||||
docker compose exec gateway neuronetz-gateway set-budget --key nz_abc12345 \
|
||||
--daily 1000000 --monthly 30000000 --total 500000000
|
||||
```
|
||||
|
||||
- Budgets are enforced **fail-closed**: when the binding period hits zero remaining, requests
|
||||
return **429** with a descriptive error and a `Retry-After` header. The binding period and
|
||||
remaining balance are surfaced on every response via `X-Budget-Period` and
|
||||
`X-Budget-Tokens-Remaining` (SPEC §6.5).
|
||||
- Live counters live in Redis; the Postgres ledger (`gateway.budget_usage`) is the source of
|
||||
truth on period rollover/reset.
|
||||
|
||||
---
|
||||
|
||||
## Model policy
|
||||
|
||||
### Set an explicit allowlist (default-deny)
|
||||
|
||||
```bash
|
||||
docker compose exec gateway neuronetz-gateway set-models --tenant acme \
|
||||
--models llama3.1:8b,mistral:7b
|
||||
```
|
||||
|
||||
The tenant's **effective set** is `allowed_models ∩ discovered` — entries that aren't
|
||||
actually installed on the backend silently never resolve. A request for a model outside the
|
||||
effective set returns a generic **403** (same response as "doesn't exist" — no enumeration).
|
||||
|
||||
### Toggle `allow_all_models`
|
||||
|
||||
```bash
|
||||
docker compose exec gateway neuronetz-gateway set-models --tenant acme --allow-all # opt in
|
||||
docker compose exec gateway neuronetz-gateway set-models --tenant acme --no-allow-all # back to allowlist
|
||||
```
|
||||
|
||||
With `allow_all_models` on, the effective set **is** the live discovered set — any model
|
||||
pulled into Ollama becomes usable on the next discovery refresh, with no further config
|
||||
change. This is an audited convenience; prefer explicit allowlists for untrusted tenants
|
||||
(see [`THREAT_MODEL.md`](THREAT_MODEL.md)).
|
||||
|
||||
### Inspect discovery and effective sets
|
||||
|
||||
```bash
|
||||
docker compose exec gateway neuronetz-gateway list-models # live-discovered models
|
||||
docker compose exec gateway neuronetz-gateway list-models --tenant acme # + that tenant's effective set
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
docker compose exec gateway neuronetz-gateway show-usage --tenant acme --period day
|
||||
# prints: requests=… tokens_in=… tokens_out=… (period: day|month|total)
|
||||
```
|
||||
|
||||
For per-key forensics and finer slicing, query `gateway.audit_log` directly (it records
|
||||
`request_id`, `key_prefix`, `model`, `tokens_in/out`, `status`, `latency_ms`, `client_ip`).
|
||||
|
||||
---
|
||||
|
||||
## How model discovery refresh works (SPEC §4.6)
|
||||
|
||||
- A background task polls Ollama `GET /api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds and
|
||||
caches the result in Redis (`gateway:models:discovered`, TTL `MODEL_DISCOVERY_CACHE_TTL_S`)
|
||||
plus an in-process copy for hot reads.
|
||||
- A model pulled into Ollama out-of-band appears in `allow_all_models` tenants' effective sets
|
||||
within one refresh interval — no config change.
|
||||
- Discovery is **read-only** and uses only the allowlisted `/api/tags` endpoint; it never
|
||||
triggers a pull.
|
||||
- To force a faster pickup, lower `MODEL_DISCOVERY_REFRESH_S` (the demo uses 15 s).
|
||||
|
||||
---
|
||||
|
||||
## Fail-closed behaviors to expect
|
||||
|
||||
| Symptom | Cause | Correct behavior |
|
||||
|---|---|---|
|
||||
| `503` on every request | Redis or Postgres-read down | Fail-closed — rate-limit/budget/auth can't be checked, so deny. Restore the backend. |
|
||||
| `502` with retry-after | Ollama unreachable | Circuit breaker opens after 5 consecutive failures, half-opens after 30 s. Check the backend / `OLLAMA_BASE_URL`. |
|
||||
| `403` for a model you "know" exists | Model not in the tenant's effective set, **or** discovery cache empty/expired | Check `list-models --tenant …`; verify the backend is reachable and the model is installed. Empty discovery = deny by design. |
|
||||
| `429` with `Retry-After` | Rate limit or budget exhausted | Inspect headers (`X-RateLimit-*`, `X-Budget-*`); raise limits/budget or wait. |
|
||||
| `401` immediately after revoke | Working as intended | Revocation propagated via NOTIFY → Redis eviction. |
|
||||
|
||||
`/readyz` returns `503` when **any** dependency (DB, Redis, Ollama) is unreachable; use it as
|
||||
the load-balancer health gate. `/healthz` only checks process liveness.
|
||||
|
||||
---
|
||||
|
||||
## Logs, metrics, audit
|
||||
|
||||
- **Logs:** structured (structlog), JSON in production, to stdout. Keys/secrets are never
|
||||
logged.
|
||||
- **Metrics:** Prometheus at `/metrics` (loopback only): `gateway_requests_total`,
|
||||
`gateway_tokens_total`, `gateway_request_duration_seconds`, labelled by `tenant` and
|
||||
`model` (never `key_id`).
|
||||
- **Audit log:** always-on in `gateway.audit_log`. **Prompt log** is opt-in per key and TTL'd
|
||||
(`PROMPT_LOG_DEFAULT_RETENTION_DAYS`); a sweeper enforces retention.
|
||||
Reference in New Issue
Block a user