demo + playground + docs

One-command demo so the gateway can be exercised end-to-end without a GPU or a real model download: - demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags, /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count and eval_count on the final frame, /api/embed, /api/show, /api/version). Non-root multi-stage Dockerfile, never published (internal network only). - docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground. Mirrors the prod posture (mock-ollama not exposed). - demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with allow_all_models and a fresh API key via the bootstrap CLI inside the container, then prints the key, the playground URL, and five ready-to-paste curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull). ./demo.sh --down tears everything back down with volumes. - playground/index.html — single-file dark-themed UI served same-origin by the gateway at /playground (CORS-free). Per-endpoint About card with method/ auth/streaming badges, a real description, sample request body, sample response, and a footer note. Live SSE/NDJSON rendering of the response. A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh are visibly gated until an API key is in the field; the Base URL is force-pinned to location.origin three times to defeat browser autofill. - docs/ — API.md (full endpoint reference with curl, streaming formats, error model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule, pointing at a real Ollama backend, env reference), THREAT_MODEL.md (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md. mkdocs.yml (Material theme) wires them together.
2026-05-26 20:52:33 +02:00
parent 844b02aade
commit b47a09db91
13 changed files with 2501 additions and 0 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -0,0 +1,168 @@
+# neuronetz-gateway — Architecture
+
+Distilled from [`scope-docs/SPEC.md`](../scope-docs/SPEC.md) §4. The SPEC is the source of truth.
+
+The gateway is the **hot path** of the Neuronetz API: a secure, multi-tenant proxy in front
+of an Ollama instance. The Ollama backend must never be reachable directly from the public
+internet — all access flows through this gateway. Administration (dashboards, tenant
+self-service) lives in a separate service, `neuronetz-console`, and is out of scope here.
+
+---
+
+## Component diagram (SPEC §4.1)
+
+```
+                          Internet
+                              │ TLS
+                              ▼
+                  ┌──────────────────────┐
+                  │ Caddy (sidecar)      │  Let's Encrypt for api.neuronetz.ai
+                  │ - TLS termination    │  HSTS, security headers
+                  │ - HTTP/2, HTTP/3     │
+                  └──────────┬───────────┘
+                             │ HTTP/1.1 internal
+                  ┌──────────▼───────────┐
+                  │ neuronetz-gateway    │  FastAPI + uvicorn
+                  │  - authn             │
+                  │  - rate limit        │
+                  │  - budget check      │
+                  │  - proxy + stream    │
+                  │  - token count       │
+                  │  - audit write       │
+                  └──┬────────┬──────┬───┘
+                     │        │      │
+              ┌──────▼──┐  ┌──▼───┐  │
+              │Postgres │  │Redis │  │
+              │ schema: │  │ keys │  │
+              │ gateway │  │bucket│  │
+              └─────────┘  └──────┘  │
+                                     │ internal network only
+                              ┌──────▼──────┐
+                              │   Ollama    │
+                              │ 127.0.0.1   │
+                              └─────────────┘
+
+Same Compose stack also hosts (separate from this SPEC):
+  - neuronetz-console (PHP/Nibiru) → reads schema `console`, reads schema `gateway` (SELECT)
+```
+
+Only **Caddy** publishes ports. Postgres, Redis and (critically) **Ollama** have no
+published ports and are reachable only on the internal Docker network.
+
+---
+
+## Database schemas (SPEC §4.2)
+
+A single Postgres instance with two schemas:
+
+- **`gateway`** — owned by this service; full DDL. Tables: `tenants`, `tenant_limits`,
+  `api_keys`, `key_limits`, `budget_usage`, `audit_log`, `prompt_log`, `revocations`
+  (see SPEC §5 for the full DDL).
+- **`console`** — owned by `neuronetz-console` (out of scope). The console role gets
+  `SELECT` on all `gateway.*` tables and `INSERT` on `gateway.revocations` only.
+
+If the console needs to mutate gateway state (e.g. revoke a key), it does so by inserting
+into the `gateway.revocations` **outbox** table, which the gateway tails (see Revocation below).
+
+**Limit inheritance:** limits and budgets resolve key → tenant. A `NULL` key-level value
+inherits the tenant value. For `allow_all_models`, a non-`NULL` key value overrides the
+tenant flag; otherwise the tenant flag applies (SPEC §13.7).
+
+---
+
+## Request lifecycle (SPEC §4.3)
+
+1. Caddy terminates TLS and forwards to the gateway on the internal port.
+2. Middleware extracts `Authorization: Bearer <key>`.
+3. The 12-char prefix is the Redis cache key. On miss, look up `gateway.api_keys` by prefix,
+   verify the full key with argon2id, and cache resolved metadata in Redis (TTL 60 s).
+4. **Rate limit** check — sliding window in Redis (Lua-atomic): per-key RPM + per-tenant RPM.
+5. **Budget** check — Redis counter for the current period; Postgres ledger is the source of
+   truth on reset.
+6. **Concurrency** semaphore — Redis `INCR` with TTL.
+7. **Model allowlist** check — resolve the effective set (see below); the request `model`
+   must be in it, else a generic `403`.
+8. **Endpoint allowlist** check — mutating endpoints are hard-blocked.
+9. **Body validation** — size, schema, `num_predict` cap.
+10. If an OpenAI-compat path, translate the request to the Ollama schema.
+11. Open an httpx async stream to Ollama.
+12. Stream the response back to the client, accumulating the final `prompt_eval_count` +
+    `eval_count`.
+13. On stream close: write the `gateway.audit_log` row; decrement the budget; release the
+    semaphore; if prompt logging is enabled, write `gateway.prompt_log`.
+14. On any failure: sanitized error to the client, audit row with the status code, semaphore
+    released.
+
+**Streaming integrity:** token counting and the audit write happen **after** stream close,
+never on the hot path — time-to-first-byte is not degraded by bookkeeping (SPEC §9).
+
+---
+
+## Model discovery (SPEC §4.6)
+
+The set of usable models is **never hand-maintained**; it is extracted live from Ollama.
+
+- A background task (started in the app lifespan, alongside the revocation listener) polls
+  Ollama `GET /api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds.
+- The parsed set (names + sanitized metadata: family, parameter size, quantization, size,
+  modified-at) is cached in Redis under `gateway:models:discovered` with TTL
+  `MODEL_DISCOVERY_CACHE_TTL_S`, and held in-process for hot reads on the request path.
+- An initial fetch runs at startup; if Ollama is unreachable the discovered set is empty.
+- **Fail-closed:** an empty or expired-and-unrefreshable discovered set means *no model
+  resolves* and requests are denied. Discovery never opens access on failure.
+- **Auto-grant:** because the effective set intersects with `discovered` (or *is*
+  `discovered` when `allow_all_models`), a model pulled into Ollama out-of-band becomes
+  usable to `allow_all` tenants on the next refresh — no per-tenant config change.
+- Discovery is **read-only** against Ollama and uses only the allowlisted `/api/tags`
+  endpoint; it never triggers a model pull.
+
+### Effective-set resolution (SPEC §4.3 step 7)
+
+```
+allow_all := key.allow_all_models ?? tenant.allow_all_models
+effective := discovered                                          if allow_all
+             (key.allowed_models ?? tenant.allowed_models) ∩ discovered   otherwise
+```
+
+`/api/tags` and `/v1/models` return exactly this effective set, so the listing never reveals
+models outside the tenant's reach. A model that is installed-but-unpermitted and one that is
+not installed both return the same generic `403` — no existence disclosure (SPEC §13.6).
+
+---
+
+## Failure modes — fail-closed (SPEC §4.4)
+
+| Subsystem | If down | Behavior |
+|---|---|---|
+| Postgres (read) | Key lookup fails | `503` with retry-after; nothing proxied. |
+| Postgres (write) | Audit write fails | Request still succeeds; audit row buffered in-memory ring (max 1000), drained on recovery; if the buffer fills, switch to deny mode. |
+| Redis | Rate limit / budget unavailable | `503` — fail closed. Never "allow because we can't check." |
+| Ollama | Upstream unreachable | `502` with retry-after; circuit breaker opens after 5 consecutive failures, half-open after 30 s. |
+| Caddy | Not a gateway concern | — |
+
+The governing rule (AGENT_PROMPT non-negotiable #1): **if a security or budgeting check
+cannot be performed, deny.** Never default to allow.
+
+---
+
+## Cache invalidation / key revocation (SPEC §4.5)
+
+The console revokes a key by inserting into `gateway.revocations(key_id, ts, reason)`.
+A background task in the gateway lifespan:
+
+- `LISTEN`s on the Postgres channel `key_revoked` (the gateway emits `NOTIFY` on its own
+  write path; the console's INSERT fires a trigger that emits it).
+- On notification, evicts the Redis cache entry for that key's prefix.
+
+This makes revocation effectively immediate (≤ Redis RTT) with no cross-service HTTP.
+
+---
+
+## Observability
+
+- **Structured logs** (structlog), JSON in production. Secrets/keys are never logged.
+- **Prometheus** `/metrics` (loopback only): `gateway_requests_total{tenant,model,status}`,
+  `gateway_tokens_total{tenant,model,direction}`,
+  `gateway_request_duration_seconds{tenant,model}` (histogram). Labelled by `tenant`, never
+  by `key_id` (cardinality — SPEC §13.3); per-key data lives in Postgres.
+- **Audit log** — always-on request metadata. **Prompt log** — opt-in per key, TTL'd.