demo + playground + docs

One-command demo so the gateway can be exercised end-to-end without a GPU or a
real model download:

- demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags,
  /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count
  and eval_count on the final frame, /api/embed, /api/show, /api/version).
  Non-root multi-stage Dockerfile, never published (internal network only).
- docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with
  PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground.
  Mirrors the prod posture (mock-ollama not exposed).
- demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with
  allow_all_models and a fresh API key via the bootstrap CLI inside the
  container, then prints the key, the playground URL, and five ready-to-paste
  curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull).
  ./demo.sh --down tears everything back down with volumes.
- playground/index.html — single-file dark-themed UI served same-origin by
  the gateway at /playground (CORS-free). Per-endpoint About card with method/
  auth/streaming badges, a real description, sample request body, sample
  response, and a footer note. Live SSE/NDJSON rendering of the response.
  A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh
  are visibly gated until an API key is in the field; the Base URL is
  force-pinned to location.origin three times to defeat browser autofill.
- docs/ — API.md (full endpoint reference with curl, streaming formats, error
  model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery
  + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule,
  pointing at a real Ollama backend, env reference), THREAT_MODEL.md
  (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md
  (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md.
  mkdocs.yml (Material theme) wires them together.
This commit is contained in:
Stephan Berbig
2026-05-26 20:52:33 +02:00
parent 844b02aade
commit b47a09db91
13 changed files with 2501 additions and 0 deletions

253
docs/API.md Normal file
View File

@@ -0,0 +1,253 @@
# neuronetz-gateway — API Reference
The gateway exposes two compatible API surfaces in front of the Ollama backend:
- **Native Ollama** under `/api/*` — NDJSON streaming, identical request shapes to Ollama.
- **OpenAI-compatible** under `/v1/*` — SSE streaming, drop-in for the OpenAI SDKs.
Plus unauthenticated health endpoints. Everything else is blocked.
> Source of truth: [`scope-docs/SPEC.md`](../scope-docs/SPEC.md) §6. Where this doc and the
> SPEC disagree, the SPEC wins.
---
## Authentication
Every model endpoint requires an API key as a Bearer token:
```
Authorization: Bearer nz_<12-char-prefix><32-char-random>
```
- **Key format:** `nz_` namespace + random base62 body. The first 12 characters
(`nz_` + entropy) are the **prefix**, stored in cleartext and indexed for O(1) lookup.
The full key is **argon2id**-hashed; it is shown **exactly once** at creation
(`neuronetz-gateway create-key`) and never stored or logged.
- **Fail-closed:** a missing, malformed, expired, disabled, or revoked key returns **401**.
No upstream/Ollama detail is ever leaked in the error.
- Health endpoints (`/healthz`, `/readyz`) require **no** auth.
The placeholder key `nz_demoKEY...` is used throughout this doc. `./demo.sh` prints a
**real** key for the local demo.
---
## Response headers (SPEC §6.5)
Every proxied response carries:
| Header | Meaning |
|---|---|
| `X-Request-ID` | Correlates the response with the audit log row. Present on errors too. |
| `X-RateLimit-Limit-Requests` | Effective RPM limit for this key/tenant. |
| `X-RateLimit-Remaining-Requests` | Requests remaining in the current window. |
| `X-RateLimit-Limit-Tokens` | Effective TPM limit. |
| `X-RateLimit-Remaining-Tokens` | Tokens remaining in the current window. |
| `X-Budget-Period` | `day` \| `month` \| `total` — the binding budget period. |
| `X-Budget-Tokens-Remaining` | Tokens left in the binding budget period. |
`429 Too Many Requests` responses additionally carry `Retry-After: <seconds>`.
---
## Error model
Errors are **sanitized** at the gateway boundary — Ollama internals are never reflected.
The body is a small generic JSON object and the `X-Request-ID` header ties it to the audit log.
```json
{ "error": { "message": "forbidden", "type": "forbidden", "code": 403 }, "request_id": "b3f1…" }
```
| Status | When |
|---|---|
| `400` | Malformed body, schema violation, or `num_predict` over the cap. |
| `401` | Missing / invalid / expired / revoked key. |
| `403` | Endpoint hard-blocked, or model outside the tenant's effective set (no existence disclosure). |
| `413` | Request body over `MAX_REQUEST_BODY_BYTES` (default 256 KiB). |
| `429` | Rate limit or budget exceeded (carries `Retry-After`). |
| `502` | Ollama upstream unreachable / circuit breaker open. |
| `503` | A required subsystem (Postgres read, Redis) is down — **fail-closed**, never "allow". |
A model that is *installed-but-unpermitted* and a model that is *not installed* return the
**same** generic `403`, to prevent enumeration (SPEC §13.6).
---
## Native Ollama endpoints (`/api/*`)
### `POST /api/chat`
Streamed (NDJSON, default) or non-streamed chat completion.
```bash
curl -N http://localhost:8080/api/chat \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" \
-d '{"model":"llama3.1:8b","stream":true,
"messages":[{"role":"user","content":"Say hello in one sentence."}]}'
```
**Streaming response**`Content-Type: application/x-ndjson`, one JSON object per line:
```
{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":"Echo:"},"done":false}
{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":" Say"},"done":false}
{"model":"llama3.1:8b","done":true,"done_reason":"stop",
"prompt_eval_count":6,"eval_count":7,"total_duration":1234567890,"eval_duration":34567890}
```
The **final** object carries `prompt_eval_count` (tokens in) and `eval_count` (tokens out);
the gateway uses these for precise token accounting (SPEC §4.3 step 12).
**Non-streaming** (`"stream": false`) returns a single JSON object of the same shape with
`"done": true`.
### `POST /api/generate`
Same semantics as `/api/chat` but uses a flat `prompt` string and returns `response`
fields instead of `message` objects.
```bash
curl -N http://localhost:8080/api/generate \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" \
-d '{"model":"llama3.1:8b","stream":true,"prompt":"Write a haiku about routers."}'
```
### `POST /api/embed` / `POST /api/embeddings`
Non-streamed embeddings. `/api/embed` is the newer batch endpoint (field `embeddings`,
a list of vectors); `/api/embeddings` is the legacy single-vector endpoint (field
`embedding`). Ollama returns no `eval_count` for embeddings; cost is charged on
`prompt_eval_count` only (SPEC §13.1).
```bash
curl http://localhost:8080/api/embed \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" \
-d '{"model":"nomic-embed-text","input":["hello","world"]}'
```
```json
{ "model": "nomic-embed-text", "embeddings": [[0.0, 0.1, ], [0.0, 0.1, ]], "prompt_eval_count": 2 }
```
### `GET /api/tags`
Returns the tenant's **effective** model set — the live-discovered set intersected with the
tenant's allowlist, or *all* discovered models when `allow_all_models` is on. Sourced from
discovery (SPEC §4.6), never a static list.
```bash
curl http://localhost:8080/api/tags -H "Authorization: Bearer nz_demoKEY..."
```
### `POST /api/show`
Allowed only for models in the effective set; returns **sanitized** model info.
The system prompt and template that Ollama returns are **stripped** by the gateway.
### `GET /api/version`
Returns the **gateway** version, not the Ollama version.
```json
{ "version": "0.1.0" }
```
---
## Hard-blocked endpoints (always `403`)
These model-mutating endpoints are blocked at the gateway. **Not configurable, not behind a
flag** (SPEC §6.2, AGENT_PROMPT non-negotiable #5):
```
/api/pull /api/push /api/create /api/copy /api/delete /api/blobs/*
```
```bash
# Always 403, even with a valid key:
curl -i http://localhost:8080/api/pull \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" -d '{"model":"llama3.1:8b"}'
```
`GET /api/ps` is also blocked (it would leak which models are loaded).
---
## OpenAI-compatible endpoints (`/v1/*`)
| Path | Method | Maps to |
|---|---|---|
| `/v1/chat/completions` | POST | `/api/chat` |
| `/v1/completions` | POST | `/api/generate` |
| `/v1/embeddings` | POST | `/api/embed` |
| `/v1/models` | GET | `/api/tags` (effective set, OpenAI list format) |
Streaming uses **SSE**: `data: {…}\n\n` events terminated by a literal `data: [DONE]\n\n`.
### `POST /v1/chat/completions`
```bash
curl -N http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" \
-d '{"model":"llama3.1:8b","stream":true,
"messages":[{"role":"user","content":"Say hello in one sentence."}]}'
```
**Streaming response**`Content-Type: text/event-stream`:
```
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Echo:"},"finish_reason":null}]}
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" Say"},"finish_reason":null}]}
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":6,"completion_tokens":7,"total_tokens":13}}
data: [DONE]
```
Works with the OpenAI Python SDK by pointing `base_url` at `http://localhost:8080/v1`.
### `GET /v1/models`
```bash
curl http://localhost:8080/v1/models -H "Authorization: Bearer nz_demoKEY..."
```
```json
{ "object": "list", "data": [
{ "id": "llama3.1:8b", "object": "model", "owned_by": "neuronetz" },
{ "id": "mistral:7b", "object": "model", "owned_by": "neuronetz" }
] }
```
---
## Health endpoints
| Path | Method | Auth | Purpose |
|---|---|---|---|
| `/healthz` | GET | none | Liveness — process responsive (`200`). |
| `/readyz` | GET | none | Readiness — DB + Redis + Ollama reachable, else `503`. |
| `/metrics` | GET | none (loopback only) | Prometheus exposition. |
```bash
curl -i http://localhost:8080/healthz # 200 {"status":"ok"}
curl -i http://localhost:8080/readyz # 200 when all deps up; 503 otherwise
```
---
## Quick reference: streaming formats
| Surface | Content-Type | Frame | Terminator |
|---|---|---|---|
| Native `/api/*` | `application/x-ndjson` | one JSON object per `\n` | final object has `"done": true` |
| OpenAI `/v1/*` | `text/event-stream` | `data: {…}\n\n` | `data: [DONE]\n\n` |

168
docs/ARCHITECTURE.md Normal file
View File

@@ -0,0 +1,168 @@
# neuronetz-gateway — Architecture
Distilled from [`scope-docs/SPEC.md`](../scope-docs/SPEC.md) §4. The SPEC is the source of truth.
The gateway is the **hot path** of the Neuronetz API: a secure, multi-tenant proxy in front
of an Ollama instance. The Ollama backend must never be reachable directly from the public
internet — all access flows through this gateway. Administration (dashboards, tenant
self-service) lives in a separate service, `neuronetz-console`, and is out of scope here.
---
## Component diagram (SPEC §4.1)
```
Internet
│ TLS
┌──────────────────────┐
│ Caddy (sidecar) │ Let's Encrypt for api.neuronetz.ai
│ - TLS termination │ HSTS, security headers
│ - HTTP/2, HTTP/3 │
└──────────┬───────────┘
│ HTTP/1.1 internal
┌──────────▼───────────┐
│ neuronetz-gateway │ FastAPI + uvicorn
│ - authn │
│ - rate limit │
│ - budget check │
│ - proxy + stream │
│ - token count │
│ - audit write │
└──┬────────┬──────┬───┘
│ │ │
┌──────▼──┐ ┌──▼───┐ │
│Postgres │ │Redis │ │
│ schema: │ │ keys │ │
│ gateway │ │bucket│ │
└─────────┘ └──────┘ │
│ internal network only
┌──────▼──────┐
│ Ollama │
│ 127.0.0.1 │
└─────────────┘
Same Compose stack also hosts (separate from this SPEC):
- neuronetz-console (PHP/Nibiru) → reads schema `console`, reads schema `gateway` (SELECT)
```
Only **Caddy** publishes ports. Postgres, Redis and (critically) **Ollama** have no
published ports and are reachable only on the internal Docker network.
---
## Database schemas (SPEC §4.2)
A single Postgres instance with two schemas:
- **`gateway`** — owned by this service; full DDL. Tables: `tenants`, `tenant_limits`,
`api_keys`, `key_limits`, `budget_usage`, `audit_log`, `prompt_log`, `revocations`
(see SPEC §5 for the full DDL).
- **`console`** — owned by `neuronetz-console` (out of scope). The console role gets
`SELECT` on all `gateway.*` tables and `INSERT` on `gateway.revocations` only.
If the console needs to mutate gateway state (e.g. revoke a key), it does so by inserting
into the `gateway.revocations` **outbox** table, which the gateway tails (see Revocation below).
**Limit inheritance:** limits and budgets resolve key → tenant. A `NULL` key-level value
inherits the tenant value. For `allow_all_models`, a non-`NULL` key value overrides the
tenant flag; otherwise the tenant flag applies (SPEC §13.7).
---
## Request lifecycle (SPEC §4.3)
1. Caddy terminates TLS and forwards to the gateway on the internal port.
2. Middleware extracts `Authorization: Bearer <key>`.
3. The 12-char prefix is the Redis cache key. On miss, look up `gateway.api_keys` by prefix,
verify the full key with argon2id, and cache resolved metadata in Redis (TTL 60 s).
4. **Rate limit** check — sliding window in Redis (Lua-atomic): per-key RPM + per-tenant RPM.
5. **Budget** check — Redis counter for the current period; Postgres ledger is the source of
truth on reset.
6. **Concurrency** semaphore — Redis `INCR` with TTL.
7. **Model allowlist** check — resolve the effective set (see below); the request `model`
must be in it, else a generic `403`.
8. **Endpoint allowlist** check — mutating endpoints are hard-blocked.
9. **Body validation** — size, schema, `num_predict` cap.
10. If an OpenAI-compat path, translate the request to the Ollama schema.
11. Open an httpx async stream to Ollama.
12. Stream the response back to the client, accumulating the final `prompt_eval_count` +
`eval_count`.
13. On stream close: write the `gateway.audit_log` row; decrement the budget; release the
semaphore; if prompt logging is enabled, write `gateway.prompt_log`.
14. On any failure: sanitized error to the client, audit row with the status code, semaphore
released.
**Streaming integrity:** token counting and the audit write happen **after** stream close,
never on the hot path — time-to-first-byte is not degraded by bookkeeping (SPEC §9).
---
## Model discovery (SPEC §4.6)
The set of usable models is **never hand-maintained**; it is extracted live from Ollama.
- A background task (started in the app lifespan, alongside the revocation listener) polls
Ollama `GET /api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds.
- The parsed set (names + sanitized metadata: family, parameter size, quantization, size,
modified-at) is cached in Redis under `gateway:models:discovered` with TTL
`MODEL_DISCOVERY_CACHE_TTL_S`, and held in-process for hot reads on the request path.
- An initial fetch runs at startup; if Ollama is unreachable the discovered set is empty.
- **Fail-closed:** an empty or expired-and-unrefreshable discovered set means *no model
resolves* and requests are denied. Discovery never opens access on failure.
- **Auto-grant:** because the effective set intersects with `discovered` (or *is*
`discovered` when `allow_all_models`), a model pulled into Ollama out-of-band becomes
usable to `allow_all` tenants on the next refresh — no per-tenant config change.
- Discovery is **read-only** against Ollama and uses only the allowlisted `/api/tags`
endpoint; it never triggers a model pull.
### Effective-set resolution (SPEC §4.3 step 7)
```
allow_all := key.allow_all_models ?? tenant.allow_all_models
effective := discovered if allow_all
(key.allowed_models ?? tenant.allowed_models) ∩ discovered otherwise
```
`/api/tags` and `/v1/models` return exactly this effective set, so the listing never reveals
models outside the tenant's reach. A model that is installed-but-unpermitted and one that is
not installed both return the same generic `403` — no existence disclosure (SPEC §13.6).
---
## Failure modes — fail-closed (SPEC §4.4)
| Subsystem | If down | Behavior |
|---|---|---|
| Postgres (read) | Key lookup fails | `503` with retry-after; nothing proxied. |
| Postgres (write) | Audit write fails | Request still succeeds; audit row buffered in-memory ring (max 1000), drained on recovery; if the buffer fills, switch to deny mode. |
| Redis | Rate limit / budget unavailable | `503` — fail closed. Never "allow because we can't check." |
| Ollama | Upstream unreachable | `502` with retry-after; circuit breaker opens after 5 consecutive failures, half-open after 30 s. |
| Caddy | Not a gateway concern | — |
The governing rule (AGENT_PROMPT non-negotiable #1): **if a security or budgeting check
cannot be performed, deny.** Never default to allow.
---
## Cache invalidation / key revocation (SPEC §4.5)
The console revokes a key by inserting into `gateway.revocations(key_id, ts, reason)`.
A background task in the gateway lifespan:
- `LISTEN`s on the Postgres channel `key_revoked` (the gateway emits `NOTIFY` on its own
write path; the console's INSERT fires a trigger that emits it).
- On notification, evicts the Redis cache entry for that key's prefix.
This makes revocation effectively immediate (≤ Redis RTT) with no cross-service HTTP.
---
## Observability
- **Structured logs** (structlog), JSON in production. Secrets/keys are never logged.
- **Prometheus** `/metrics` (loopback only): `gateway_requests_total{tenant,model,status}`,
`gateway_tokens_total{tenant,model,direction}`,
`gateway_request_duration_seconds{tenant,model}` (histogram). Labelled by `tenant`, never
by `key_id` (cardinality — SPEC §13.3); per-key data lives in Postgres.
- **Audit log** — always-on request metadata. **Prompt log** — opt-in per key, TTL'd.

188
docs/DEPLOYMENT.md Normal file
View File

@@ -0,0 +1,188 @@
# neuronetz-gateway — Deployment
Production deployment is a single Docker Compose stack: **Caddy + gateway + Postgres + Redis
+ Ollama**. Caddy is the only public-facing component; it terminates TLS via Let's Encrypt
for `api.neuronetz.ai` and reverse-proxies to the internal-only gateway.
> For the local, no-GPU demo (mock Ollama + playground), see [`PLAYGROUND.md`](PLAYGROUND.md)
> and run `./demo.sh`. This document is the **production** path.
---
## The one rule that must never break
> ## ⛔ Ollama is NEVER exposed to the host or the internet.
>
> The `ollama` service in `docker-compose.yml` has **no `ports:` mapping** and must never
> get one. Ollama is reachable only on the internal Docker network as `ollama:11434`.
> Publishing it would re-open the exact unauthenticated exposure this whole project exists
> to close (SPEC §1, §3; AGENT_PROMPT non-negotiable #2).
The same posture applies to **Postgres** and **Redis** in the production compose file — no
published ports. Only **Caddy** binds host ports (80/443, 443/udp for HTTP/3).
---
## Prerequisites
- A host with Docker + Docker Compose.
- DNS: `api.neuronetz.ai` → the host's public IP (for Let's Encrypt).
- Ports 80 and 443 reachable from the internet (ACME HTTP/TLS challenge + serving).
---
## Steps
```bash
git clone <repo> neuronetz-gateway && cd neuronetz-gateway
# 1. Configure. Copy the example env and change EVERY secret.
cp .env.example .env
# - POSTGRES_PASSWORD: a strong, unique value
# - DATABASE_URL: must match the POSTGRES_* values
# - GATEWAY_LOG_FORMAT=json for production
# 2. Configure Caddy for your domain + ACME email.
cp ops/caddy/Caddyfile.example ops/caddy/Caddyfile # then edit the site + email
# (docker-compose.yml mounts Caddyfile.example by default; point it at your edited file
# or edit in place.)
# 3. Bring up the full stack. The gateway runs `alembic upgrade head`, then serves.
docker compose up -d --build
# 4. Bootstrap a tenant + key (CLI runs inside the gateway container).
docker compose exec gateway neuronetz-gateway create-tenant --name acme --rpm 120 --tpm 200000
docker compose exec gateway neuronetz-gateway create-key --tenant acme --name prod-server-1
# ^ prints the full key ONCE — store it in your secret manager now.
# 5. Smoke test (through Caddy / TLS).
curl https://api.neuronetz.ai/healthz
curl -N https://api.neuronetz.ai/v1/chat/completions \
-H "Authorization: Bearer nz_…" -H "Content-Type: application/json" \
-d '{"model":"llama3.1:8b","stream":true,"messages":[{"role":"user","content":"hi"}]}'
```
Caddy obtains and renews the certificate automatically. For local testing without a public
domain, use the `localhost { tls internal … }` block documented in `Caddyfile.example`
(trust Caddy's local CA or pass `-k` to curl).
---
## Pointing at a real Ollama backend
The gateway reaches Ollama via `OLLAMA_BASE_URL`. In the bundled stack this is the in-stack
`ollama` service: `OLLAMA_BASE_URL=http://ollama:11434`.
To use an **existing/external** Ollama host instead:
1. Remove the `ollama` service from `docker-compose.yml` (or leave it; it just won't be used).
2. Set `OLLAMA_BASE_URL` to the backend address reachable from the gateway container, e.g.
`http://10.0.0.5:11434` or an internal DNS name.
3. Ensure that backend is itself **not** exposed to the internet — the gateway is the only
thing that should ever reach it. Use a private network / firewall rule, not a public port.
4. Pull the models you want available on that backend. They appear in tenants' effective sets
automatically on the next discovery refresh (SPEC §4.6) — no gateway config change for
`allow_all_models` tenants.
Discovery polls `OLLAMA_BASE_URL/api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds. If the
backend is unreachable, the discovered set is empty and requests **fail closed**.
---
## Environment reference (SPEC §7)
All configuration is via environment variables, validated by Pydantic Settings on boot. Boot
**fails loudly** on invalid config. See [`.env.example`](../.env.example) for a copyable file.
### Service
| Var | Default | Notes |
|---|---|---|
| `GATEWAY_BIND_HOST` | `0.0.0.0` | Bind-all inside the container. |
| `GATEWAY_BIND_PORT` | `8080` | Internal port; never published directly in prod. |
| `GATEWAY_LOG_LEVEL` | `INFO` | |
| `GATEWAY_LOG_FORMAT` | `json` | `json` in prod, `console` for local dev. |
| `GATEWAY_REQUEST_ID_HEADER` | `X-Request-ID` | |
| `GATEWAY_TRUSTED_PROXIES` | `127.0.0.1,caddy` | Sources trusted for `X-Forwarded-For`. |
### Upstream (Ollama)
| Var | Default | Notes |
|---|---|---|
| `OLLAMA_BASE_URL` | `http://ollama:11434` | Internal address of the backend. |
| `OLLAMA_CONNECT_TIMEOUT_S` | `5` | |
| `OLLAMA_READ_TIMEOUT_S` | `600` | Long, for slow generations. |
| `OLLAMA_MAX_CONNECTIONS` | `64` | httpx pool size. |
### Model discovery (§4.6)
| Var | Default | Notes |
|---|---|---|
| `MODEL_DISCOVERY_REFRESH_S` | `60` | How often to re-query `/api/tags`. |
| `MODEL_DISCOVERY_CACHE_TTL_S` | `120` | Redis TTL for the discovered set. |
### Database
| Var | Default | Notes |
|---|---|---|
| `DATABASE_URL` | `postgresql+asyncpg://…` | asyncpg driver. |
| `DATABASE_POOL_SIZE` | `10` | |
| `DATABASE_POOL_OVERFLOW` | `20` | |
### Redis
| Var | Default | Notes |
|---|---|---|
| `REDIS_URL` | `redis://redis:6379/0` | |
| `REDIS_KEY_CACHE_TTL_S` | `60` | Resolved-key cache TTL. |
### Limits (defaults; per-tenant/key DB overrides win)
| Var | Default | Notes |
|---|---|---|
| `DEFAULT_RPM` | `60` | |
| `DEFAULT_TPM` | `100000` | |
| `DEFAULT_CONCURRENT` | `8` | |
| `MAX_REQUEST_BODY_BYTES` | `262144` | 256 KiB request cap. |
| `MAX_NUM_PREDICT` | `4096` | Hard cap on requested completion tokens. |
### Security
| Var | Default | Notes |
|---|---|---|
| `ARGON2_TIME_COST` | `3` | |
| `ARGON2_MEMORY_COST_KIB` | `65536` | 64 MiB. |
| `ARGON2_PARALLELISM` | `4` | |
| `AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN` | `20` | Throttles auth brute-force per source IP. |
### Audit
| Var | Default | Notes |
|---|---|---|
| `AUDIT_BUFFER_SIZE` | `1000` | Ring buffer; full ⇒ deny mode. |
| `PROMPT_LOG_DEFAULT_RETENTION_DAYS` | `30` | |
| `AUDIT_LOG_DEFAULT_RETENTION_DAYS` | `365` | |
---
## TLS & security headers (Caddy)
`ops/caddy/Caddyfile.example` already sets:
- **HSTS** `max-age=63072000; includeSubDomains; preload`
- `X-Content-Type-Options: nosniff`
- `X-Frame-Options: DENY`
- `Referrer-Policy: no-referrer`
- strips `Server` and `X-Powered-By`
Edit the site address and ACME `email` before deploying.
---
## Non-Compose (systemd)
A systemd unit is provided for hosts that run the image directly (`ops/systemd/`). The
gateway still requires reachable Postgres, Redis, and Ollama, and the same environment
variables. TLS in that topology is whatever fronts the host (Caddy, nginx, a load balancer) —
**Ollama still must not be publicly reachable.**
---
## Upgrades & migrations
The gateway runs `alembic upgrade head` on container start, so a normal
`docker compose up -d --build` after pulling a new version applies pending migrations. For
zero-downtime upgrades, run migrations as a one-off
(`docker compose run --rm gateway alembic upgrade head`) before rolling the service.

172
docs/OPERATIONS.md Normal file
View File

@@ -0,0 +1,172 @@
# neuronetz-gateway — Operations Runbook
Day-2 operations for the gateway: managing tenants and keys, budgets, model policy, usage,
and the fail-closed behaviors you'll encounter. All administration is via the **bootstrap
CLI** (SPEC §11), run inside the gateway container. There are no admin HTTP endpoints in the
gateway (that's `neuronetz-console`'s job).
> Run the CLI inside the running container:
> ```bash
> docker compose exec gateway neuronetz-gateway <command> …
> ```
> In the demo stack, swap the compose file: `docker compose -f docker-compose.demo.yml exec gateway …`
---
## Keys
### Create a key
```bash
docker compose exec gateway neuronetz-gateway create-key --tenant acme --name prod-server-1
# optional: --scopes chat,embeddings (default: chat,embeddings)
```
The **full key is printed exactly once** in the form `nz_<prefix><secret>`. Store it
immediately in your secret manager — it is argon2id-hashed and cannot be recovered. Only the
12-char `prefix` is retained server-side.
### List keys (never shows full keys)
```bash
docker compose exec gateway neuronetz-gateway list-keys --tenant acme
# prints: <prefix> status=active name='prod-server-1' created=…
```
### Revoke a key
```bash
docker compose exec gateway neuronetz-gateway revoke-key --prefix nz_abc12345
```
This sets the key status to `revoked` and writes the `gateway.revocations` outbox row. A
Postgres `NOTIFY` on channel `key_revoked` fires; the gateway evicts the key's Redis cache
entry, so revocation takes effect within ~1 second (SPEC §4.5) without restarting anything.
A subsequent request with that key returns **401**.
> The console (`neuronetz-console`) revokes keys the same way — by inserting into
> `gateway.revocations`. The trigger-driven NOTIFY makes it immediate without any
> cross-service HTTP call.
### Rotate a key
There is no in-place rotate. Rotate by: create a new key → deploy it to the client → verify
traffic on the new prefix → revoke the old prefix.
---
## Tenants & limits
### Create a tenant
```bash
docker compose exec gateway neuronetz-gateway create-tenant --name acme \
--rpm 120 --tpm 200000 --concurrent 8
# add --allow-all-models to opt into using any installed model (default: off)
```
Limits inherit **key → tenant**: a `NULL` key-level limit uses the tenant value.
---
## Budgets
Set per-key token budgets (any combination of daily / monthly / total):
```bash
docker compose exec gateway neuronetz-gateway set-budget --key nz_abc12345 \
--daily 1000000 --monthly 30000000 --total 500000000
```
- Budgets are enforced **fail-closed**: when the binding period hits zero remaining, requests
return **429** with a descriptive error and a `Retry-After` header. The binding period and
remaining balance are surfaced on every response via `X-Budget-Period` and
`X-Budget-Tokens-Remaining` (SPEC §6.5).
- Live counters live in Redis; the Postgres ledger (`gateway.budget_usage`) is the source of
truth on period rollover/reset.
---
## Model policy
### Set an explicit allowlist (default-deny)
```bash
docker compose exec gateway neuronetz-gateway set-models --tenant acme \
--models llama3.1:8b,mistral:7b
```
The tenant's **effective set** is `allowed_models ∩ discovered` — entries that aren't
actually installed on the backend silently never resolve. A request for a model outside the
effective set returns a generic **403** (same response as "doesn't exist" — no enumeration).
### Toggle `allow_all_models`
```bash
docker compose exec gateway neuronetz-gateway set-models --tenant acme --allow-all # opt in
docker compose exec gateway neuronetz-gateway set-models --tenant acme --no-allow-all # back to allowlist
```
With `allow_all_models` on, the effective set **is** the live discovered set — any model
pulled into Ollama becomes usable on the next discovery refresh, with no further config
change. This is an audited convenience; prefer explicit allowlists for untrusted tenants
(see [`THREAT_MODEL.md`](THREAT_MODEL.md)).
### Inspect discovery and effective sets
```bash
docker compose exec gateway neuronetz-gateway list-models # live-discovered models
docker compose exec gateway neuronetz-gateway list-models --tenant acme # + that tenant's effective set
```
---
## Usage
```bash
docker compose exec gateway neuronetz-gateway show-usage --tenant acme --period day
# prints: requests=… tokens_in=… tokens_out=… (period: day|month|total)
```
For per-key forensics and finer slicing, query `gateway.audit_log` directly (it records
`request_id`, `key_prefix`, `model`, `tokens_in/out`, `status`, `latency_ms`, `client_ip`).
---
## How model discovery refresh works (SPEC §4.6)
- A background task polls Ollama `GET /api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds and
caches the result in Redis (`gateway:models:discovered`, TTL `MODEL_DISCOVERY_CACHE_TTL_S`)
plus an in-process copy for hot reads.
- A model pulled into Ollama out-of-band appears in `allow_all_models` tenants' effective sets
within one refresh interval — no config change.
- Discovery is **read-only** and uses only the allowlisted `/api/tags` endpoint; it never
triggers a pull.
- To force a faster pickup, lower `MODEL_DISCOVERY_REFRESH_S` (the demo uses 15 s).
---
## Fail-closed behaviors to expect
| Symptom | Cause | Correct behavior |
|---|---|---|
| `503` on every request | Redis or Postgres-read down | Fail-closed — rate-limit/budget/auth can't be checked, so deny. Restore the backend. |
| `502` with retry-after | Ollama unreachable | Circuit breaker opens after 5 consecutive failures, half-opens after 30 s. Check the backend / `OLLAMA_BASE_URL`. |
| `403` for a model you "know" exists | Model not in the tenant's effective set, **or** discovery cache empty/expired | Check `list-models --tenant …`; verify the backend is reachable and the model is installed. Empty discovery = deny by design. |
| `429` with `Retry-After` | Rate limit or budget exhausted | Inspect headers (`X-RateLimit-*`, `X-Budget-*`); raise limits/budget or wait. |
| `401` immediately after revoke | Working as intended | Revocation propagated via NOTIFY → Redis eviction. |
`/readyz` returns `503` when **any** dependency (DB, Redis, Ollama) is unreachable; use it as
the load-balancer health gate. `/healthz` only checks process liveness.
---
## Logs, metrics, audit
- **Logs:** structured (structlog), JSON in production, to stdout. Keys/secrets are never
logged.
- **Metrics:** Prometheus at `/metrics` (loopback only): `gateway_requests_total`,
`gateway_tokens_total`, `gateway_request_duration_seconds`, labelled by `tenant` and
`model` (never `key_id`).
- **Audit log:** always-on in `gateway.audit_log`. **Prompt log** is opt-in per key and TTL'd
(`PROMPT_LOG_DEFAULT_RETENTION_DAYS`); a sweeper enforces retention.

113
docs/PLAYGROUND.md Normal file
View File

@@ -0,0 +1,113 @@
# neuronetz-gateway — Demo & Playground
The fastest way to see the gateway working end-to-end, with **no GPU and no model downloads**.
`./demo.sh` brings up the gateway against a mock Ollama backend, mints a demo API key, and
prints ready-to-paste curl commands and a link to an interactive browser playground.
---
## Launch the demo
From the repo root:
```bash
./demo.sh
```
This will:
1. Build and start the demo stack (`docker-compose.demo.yml`): **postgres + redis +
mock-ollama + gateway**. No Caddy; the gateway is published on `127.0.0.1:8080`.
2. Wait for the gateway to report healthy at `/healthz`.
3. Create a demo tenant (`--allow-all-models`) and an API key via the bootstrap CLI **inside
the gateway container**, capturing the key (which is printed exactly once).
4. Print a summary: the **API key**, the **playground URL**
`http://localhost:8080/playground`, and five ready-to-paste curl commands —
- streaming `/v1/chat/completions` (OpenAI SSE),
- streaming `/api/chat` (native NDJSON),
- `GET /v1/models`,
- a **401** example (no/bad key),
- a **403** example (`POST /api/pull`, hard-blocked).
The script is **re-runnable**: an existing tenant is reused, and each run mints a fresh,
uniquely-named key (the full key only ever prints at creation).
Tear everything down (containers + volumes):
```bash
./demo.sh --down
```
### What's running
| Service | Exposed? | Notes |
|---|---|---|
| `gateway` | `127.0.0.1:8080` | The real gateway image, built from the repo `Dockerfile`. |
| `mock-ollama` | **no** | Internal network only — mirrors the prod "Ollama is never exposed" rule. |
| `postgres` | **no** | Internal only. |
| `redis` | **no** | Internal only. |
The mock backend (`demo/mock-ollama/`) emulates Ollama's API shapes — including realistic
`prompt_eval_count` / `eval_count` on the final stream object — so token counting, model
discovery, and `/api/show` sanitization all exercise real gateway code paths. It serves a
small catalogue: `llama3.1:8b`, `mistral:7b`, `qwen2.5:3b`, `nomic-embed-text`.
---
## Use the playground
Open **http://localhost:8080/playground** in a browser. It is a single self-contained HTML
page, served **same-origin** by the gateway (so no CORS to worry about).
1. **Base URL** is pre-filled with the current origin; leave it as is for the demo.
2. Paste the **API key** from the `./demo.sh` output into the Bearer field. (Typing a key
auto-loads the model dropdown; you can also hit **↻ Refresh**.)
3. Pick an **endpoint** tab: `/v1/chat/completions`, `/api/chat`, `/api/generate`,
`/v1/models`, `/api/tags`, `/healthz`, `/readyz`.
4. Choose a **model** from the auto-populated dropdown, type a prompt, toggle **stream**.
5. Hit **▶ Run**. The streamed output renders **live** — SSE `data:` deltas (incl. `[DONE]`)
for `/v1/*`, NDJSON lines for `/api/*`.
6. The panel shows the **response status** and the rate-limit / budget **response headers**
(`X-Request-ID`, `X-RateLimit-*`, `X-Budget-*`; SPEC §6.5).
7. The **Exact curl** box mirrors precisely what **Run** sends — copy it to reproduce in a
terminal.
Try the 403 path too: there's no mutating-endpoint tab by design, but the printed `curl` for
`POST /api/pull` shows the hard block, and an invalid key in the Bearer field demonstrates the
401 fail-closed response.
---
## ⚠️ Security note: the playground is OFF by default in production
The playground route is **flag-gated** and **disabled by default**. The demo stack turns it on
explicitly:
```yaml
# docker-compose.demo.yml (gateway service)
GATEWAY_PLAYGROUND_ENABLED: "true"
GATEWAY_PLAYGROUND_FILE: /app/playground/index.html
```
with the file mounted read-only into the container:
```yaml
volumes:
- ./playground:/app/playground:ro
```
The production stack (`docker-compose.yml`) does **not** set `GATEWAY_PLAYGROUND_ENABLED`, so
the route is absent. Do not enable it on a public deployment: it is a convenience for demos and
local development, not a production surface. Leaving it off keeps the public attack surface to
the documented API only.
---
## Files behind the demo
| Path | What it is |
|---|---|
| `demo.sh` | The one-command entrypoint (up / `--down`). |
| `docker-compose.demo.yml` | The demo stack definition. |
| `demo/mock-ollama/` | The standalone mock Ollama service (FastAPI app + Dockerfile). |
| `playground/index.html` | The self-contained browser playground served at `/playground`. |

77
docs/THREAT_MODEL.md Normal file
View File

@@ -0,0 +1,77 @@
# neuronetz-gateway — Threat Model
From [`scope-docs/SPEC.md`](../scope-docs/SPEC.md) §3. The governing principle, in one line:
> **Fail closed, always.** If a security or budgeting check cannot be performed (Redis down,
> DB unreachable, ambiguous state), **deny** the request. Never default to allow.
> (AGENT_PROMPT non-negotiable #1.)
The gateway exists because the Ollama instance at `api.neuronetz.ai` was exposed without
authentication — a standing security incident. Every defense below traces back to closing
that gap and keeping it closed.
---
## Threats & mitigations (SPEC §3)
| Threat | Mitigation |
|---|---|
| Internet scanners hitting Ollama directly | Ollama bound to the internal Docker network; **never published**. No `ports:` mapping in any shipped compose file. |
| Unauthenticated API abuse | Mandatory Bearer token; **fail-closed** on auth errors (401). |
| API key brute force | Argon2id hashing; constant-time compare; rate limit on auth failures per source IP (`AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN`). |
| GPU/token exhaustion (cost attack) | Per-key TPM + token budget; per-tenant ceiling; concurrent-connection cap. |
| Resource exhaustion via large payloads | Request body size limit (default 256 KiB); `num_predict` cap (default 4096). |
| Model enumeration / training-data exfil via uncommon models | Model allowlist, **default-deny**. Discovery only exposes models actually installed; `/api/tags` and `/v1/models` never reveal models outside the tenant's effective set; "not allowed" and "doesn't exist" return the **same** generic response. |
| Discovery backend unreachable | **Fail-closed:** an empty/stale-expired discovered set means no model resolves, so requests are denied — never "allow because we couldn't list models." |
| Ollama mutation (model pull/delete) by attacker | Endpoint allowlist; mutating endpoints (`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`, `/api/blobs/*`) **hard-blocked** at the gateway, not configurable. |
| Information disclosure via error messages | Upstream errors **sanitized** at the boundary; Ollama internals never proxied to the client. Each error carries an `X-Request-ID` for correlation. |
| Audit log tampering | Append-only at the app layer; DB role separation; optional WAL archiving. |
| Prompt data leakage | Prompt logging **off by default**; opt-in per key; TTL'd retention; redaction hook. |
| Redis outage causing "fail open" | **Fail-closed:** if the rate-limit/budget backend is unavailable, deny (503), not allow. |
| Compromised admin token | There is **no admin endpoint** in the gateway. Admin lives in `neuronetz-console`; the gateway has nothing to compromise here. |
---
## Notes on selected defenses
### `allow_all_models` is an audited opt-in
`allow_all_models` lets a tenant use any currently-installed model, so models newly pulled
into Ollama are auto-granted on the next discovery refresh. This is convenient but widens the
attack surface for *that tenant*, so it is:
- **opt-in per tenant** (default `false`), set explicitly via the CLI
(`create-tenant --allow-all-models` or `set-models --allow-all`);
- **overridable per key** — a non-`NULL` key-level `allow_all_models` overrides the tenant
flag; otherwise the tenant flag applies (SPEC §13.7);
- **audited** — every request records the model used in `gateway.audit_log`.
Default-deny tenants instead see only `allowed_models ∩ discovered`. Either way the effective
set is always intersected with the *live* discovered set, so stale or typo'd allowlist entries
never resolve.
### No existence disclosure
A model that is installed-but-unpermitted and a model that is not installed both return the
**same** generic `403`. An attacker cannot use the gateway to enumerate which models exist on
the backend (SPEC §13.6).
### Sanitized errors + request IDs
Clients never receive Ollama's error text, stack traces, or internal hostnames. Errors are
mapped to generic `4xx`/`5xx` JSON with a `request_id`. Operators correlate that ID with the
audit log to investigate without leaking internals to callers (SPEC §4.3 step 14).
### Streaming integrity is also a safety property
Token counting and audit writes happen **after** stream close, never on the hot path. This
keeps time-to-first-byte honest and ensures budget decrements and audit rows reflect the true
final token counts reported by Ollama (`prompt_eval_count` + `eval_count`), not estimates.
---
## Out of scope (v0.1.0)
Documented as future work, **not** mitigations present today: content moderation /
prompt-injection filtering, response caching, multi-backend routing, billing, SSO/OAuth2 for
admin, and any web admin UI (that lives in `neuronetz-console`).