scaffold: project skeleton, schema, healthz/readyz, CI

Initial project structure for neuronetz-gateway per scope-docs/SPEC.md: - Python 3.12 / FastAPI / SQLAlchemy 2.0 (async) / Redis / Postgres stack managed by uv. Multi-stage non-root Dockerfile, prod + dev compose files (ollama service is NEVER published in either), Caddyfile + systemd unit, justfile, GitHub Actions CI (ruff, mypy --strict, pytest, bandit, pip-audit). - Pydantic-Settings config covering every env var from SPEC §7, including the MODEL_DISCOVERY_* keys for the dynamic-discovery feature (§4.6). - Alembic 0001_initial creates the full gateway schema (8 tables, 3 enums, notify_key_revoked() trigger), incl. allow_all_models on tenant_limits and key_limits for the per-tenant auto-grant toggle. - Working /healthz, /readyz (fail-closed when deps unreachable), and a Prometheus /metrics stub. Sanitizing error handlers that attach X-Request-ID to every response and never leak upstream internals. - SPEC + AGENT_PROMPT included under scope-docs/ (source of truth).
2026-05-26 20:50:35 +02:00
commit d79f17b3bb
32 changed files with 3610 additions and 0 deletions
--- a/scope-docs/AGENT_PROMPT.md
+++ b/scope-docs/AGENT_PROMPT.md
@@ -0,0 +1,121 @@
+# Build Order: neuronetz-gateway v0.1.0
+
+## Context
+
+The Ollama instance at `https://api.neuronetz.ai` is currently exposed without authentication. This is a security incident in waiting. Your job is to build the gateway that closes that gap and forms the commercial API surface of the Neuronetz AI platform.
+
+The full specification is in **`SPEC.md`** in this repository. Read it before writing any code. It is the source of truth; if anything below conflicts with it, SPEC.md wins.
+
+## Mission
+
+Implement `neuronetz-gateway` per SPEC.md to a state that satisfies **§12 Acceptance Criteria**. Nothing less ships.
+
+## Non-Negotiables
+
+These are hard constraints. Violating any of them is a build failure regardless of feature completeness.
+
+1. **Fail closed, always.** If a security or budgeting check cannot be performed (Redis down, DB unreachable, ambiguous state), deny the request. Never default to allow.
+2. **Ollama never reachable from outside the Docker internal network.** No `ports:` mapping for the ollama service in any compose file shipped with the project. Document this prominently.
+3. **No secrets in code, no secrets in logs, no secrets in errors.** Argon2id for key storage. Constant-time comparison only. Keys printed exactly once at creation.
+4. **No reflected upstream errors.** Ollama errors are sanitized at the gateway boundary. Map to generic 4xx/5xx with a request ID.
+5. **Mutating Ollama endpoints (`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`, `/api/blobs/*`) are hard-blocked.** Not configurable. Not behind a feature flag. Blocked.
+6. **Streaming integrity.** Token counting and audit writes happen **after** stream close, never on the hot path. Time-to-first-byte must not be degraded by gateway bookkeeping.
+7. **`mypy --strict` and `ruff check` clean before any PR is opened.** No `# type: ignore` without an inline justification comment.
+8. **Test coverage targets (§9) are a gate, not a goal.** 100% on `auth/`, `ratelimit/`, `budget/`. CI fails below threshold.
+9. **Apache 2.0 license file present from commit one.** No GPL dependencies.
+10. **The bootstrap CLI must work before the first manual `curl`.** No "I'll create a key by hand in the DB just to test it" — if the CLI can't create a key, fix the CLI first.
+
+## Phasing
+
+Five phases. Each phase has an explicit exit criterion. **Do not start phase N+1 until phase N's exit criterion is verifiably met.** PM/Control: enforce this.
+
+### Phase 1 — Scaffold
+
+- Repo layout per SPEC §8
+- `pyproject.toml`, `uv.lock`, Dockerfile, docker-compose.yml, docker-compose.dev.yml, .env.example, README, LICENSE
+- Alembic configured; migration `0001_initial.py` creates schema `gateway` and all tables per SPEC §5
+- `make` or `just` targets: `dev`, `test`, `lint`, `typecheck`, `migrate`, `compose-up`, `compose-down`
+- CI workflow runs: ruff, mypy, pytest, bandit, pip-audit
+- **Exit criterion:** `docker compose -f docker-compose.dev.yml up` brings up postgres + redis + a stub gateway that responds 200 on `/healthz` and 503 on `/readyz` (because no Ollama yet). Migrations apply cleanly. CI is green on an empty test suite.
+
+### Phase 2 — Core proxy + auth
+
+- Bootstrap CLI (`create-tenant`, `create-key`, `list-keys`, `revoke-key`) working end-to-end
+- Argon2id hashing module with unit tests covering: hash, verify, constant-time behavior, rehash-on-parameter-change
+- Auth middleware: Bearer extraction, prefix lookup, hash verify, Redis cache with TTL
+- Ollama proxy for `/api/chat` and `/api/generate` — both streamed (NDJSON) and non-streamed
+- Endpoint allowlist enforced
+- **Model discovery (SPEC §4.6):** background poll of Ollama `/api/tags`, cached in Redis + in-process, fail-closed when unavailable
+- Model allowlist enforced per-tenant via the **effective set** (allow_all → all discovered; else `allowed_models ∩ discovered`); key-level `allow_all_models` overrides tenant
+- Error handler: sanitized responses, request ID in every error
+- Audit log writer (buffered, async)
+- Mock Ollama in `tests/integration/mock_ollama.py` (no real model required for CI)
+- **Exit criterion:** A key created via CLI can call `/api/chat` and `/api/generate` through Caddy → gateway → mock Ollama, streaming works, audit rows land in Postgres with correct token counts, `/api/pull` returns 403, no-auth returns 401, wrong-key returns 401. Model discovery populates from the (mock) Ollama `/api/tags`; `/api/tags` returns the tenant's effective set; an `allow_all_models` tenant sees all discovered models, a default-deny tenant sees only `allowed ∩ discovered`, and a non-effective model returns 403; discovery-unavailable fails closed. Integration tests cover all of the above.
+
+### Phase 3 — Rate limit + budget + OpenAI-compat
+
+- Sliding window rate limit (Redis Lua script) — per-key RPM, per-tenant RPM, per-key TPM
+- Concurrency semaphore (Redis-backed) with TTL guard
+- Token budget counters in Redis with Postgres ledger reconciliation on period rollover
+- OpenAI-compatibility layer: `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/models` with full SSE streaming and `data: [DONE]` terminator
+- Schema translation tests with golden fixtures (request in OpenAI → expected Ollama request; response from Ollama → expected OpenAI response)
+- Rate-limit and budget response headers per SPEC §6.5
+- **Exit criterion:** Locust test (100 concurrent users, 5 min) shows correct 429 behavior at the limit, correct token accounting, p99 gateway overhead < 25 ms. OpenAI Python SDK pointed at `/v1` successfully completes streaming chat. Killing Redis mid-test produces 503 (fail closed), not 200.
+
+### Phase 4 — Audit, prompt log, revocation
+
+- Prompt log (opt-in per key, TTL) with daily sweeper task
+- Audit log retention sweeper (TTL per tenant config)
+- Buffered audit writer with ring-buffer overflow → deny-mode behavior
+- Revocation flow: console (simulated via direct INSERT in tests) writes `gateway.revocations` → NOTIFY → gateway evicts Redis cache → next request with revoked key returns 401 within 1 second
+- Prometheus `/metrics` (loopback only) with: `gateway_requests_total{tenant,model,status}`, `gateway_tokens_total{tenant,model,direction}`, `gateway_request_duration_seconds{tenant,model}` (histogram)
+- `/readyz` checks DB + Redis + Ollama all reachable
+- Circuit breaker on Ollama failures
+- **Exit criterion:** Revocation E2E test green. Prompt log retention TTL works (use freeze-time to simulate). Metrics scrape returns valid Prometheus exposition. `/readyz` flips to 503 when any dependency is down.
+
+### Phase 5 — Harden, document, release
+
+- `docs/ARCHITECTURE.md`, `docs/DEPLOYMENT.md`, `docs/API.md`, `docs/THREAT_MODEL.md`, `docs/OPERATIONS.md` complete
+- Caddyfile example with Let's Encrypt for `api.neuronetz.ai` and security headers (HSTS, X-Content-Type-Options, no Server header, no X-Powered-By)
+- Systemd unit file for non-Compose deployments
+- Multi-stage Dockerfile with non-root user, distroless or `python:3.12-slim` final stage, no build tools in final image
+- `pip-audit` and `bandit` clean in CI
+- Image scan (Trivy or Grype) clean of HIGH/CRITICAL
+- Tag `v0.1.0`, build and push image, GitHub release with changelog
+- **Exit criterion:** Every box in SPEC §12 checked, signed off by Control. Image runnable from a fresh host with only docker + a `.env`. README quickstart works for someone who has never seen the repo.
+
+## Agent Role Assignments
+
+For the multi-agent orchestrator (Fritz/UI-UX/DevOps/QA/Control/Timo/PM):
+
+| Agent | Owns |
+|---|---|
+| **Backend / Fritz** | All Python code under `src/neuronetz_gateway/`, Alembic migrations, CLI. Primary author. |
+| **DevOps** | Dockerfile, docker-compose.yml(s), Caddyfile, systemd unit, CI workflows, image scanning, release tagging. |
+| **QA** | All tests under `tests/`. Owns coverage gate. Writes the locust scenarios. Verifies acceptance criteria at each phase exit. |
+| **UI-UX** | Not active this project (no UI surface here). Console project will pick this up. |
+| **Control / Timo** | Enforces phase gates. Refuses to advance a phase whose exit criterion isn't met. Runs the acceptance checklist at end of Phase 5. |
+| **PM** | Tracks the phase progression, opens YouTrack tickets per phase, runs daily standups against this prompt, surfaces blockers. |
+
+## Working Agreements
+
+- **Branch per phase.** `phase-1-scaffold`, `phase-2-proxy-auth`, etc. Merge to `main` only after phase exit criterion is verified.
+- **PRs are reviewed against SPEC.md.** "Does this match the spec? If not, is SPEC.md wrong or is the PR wrong?" — that's the review question.
+- **SPEC changes are explicit.** If a phase reveals a spec mistake, amend SPEC.md in a separate PR before changing the implementation. Never drift silently.
+- **Commit messages reference the section.** e.g. `auth: implement argon2id verify per SPEC §5, §9`.
+- **No TODOs in main.** If something is deferred, it becomes a tracked issue, not a code comment.
+- **Open questions (SPEC §13) are resolved in writing.** Decision goes in SPEC.md, not in a Slack message that gets lost.
+
+## What "Done" Looks Like
+
+A fresh clone, a fresh host, a domain pointing at it, and a `.env` file. `docker compose up`. Five minutes later, `curl -H "Authorization: Bearer nz_..." https://api.neuronetz.ai/v1/chat/completions -d '...'` streams a response. The Ollama port is not open. The audit log has a row. The budget counter decremented. The metrics endpoint shows the request. The locust suite passes. The threat model document explains every defense.
+
+When all of that is true and SPEC §12 is fully ticked, ship v0.1.0.
+
+## When You Get Stuck
+
+- **Ambiguity in the spec → ask, don't guess.** Open a question in the PM channel; if resolved, amend SPEC.md.
+- **Conflict between speed and correctness → correctness wins.** This is security infrastructure. We do not ship "good enough."
+- **Conflict between scope creep and v0.1.0 → defer.** New ideas go in a follow-up issue. v0.1.0 ships per spec.
+
+Start with Phase 1. Read SPEC.md first.
--- a/scope-docs/SPEC.md
+++ b/scope-docs/SPEC.md
@@ -0,0 +1,593 @@
+# neuronetz-gateway — SPEC.md
+
+**Project:** `neuronetz-gateway`
+**Version:** 0.1.0 (target)
+**Status:** Specification — not yet implemented
+**License:** Apache 2.0
+**Owner:** Stephan Berbig / Neuronetz
+
+---
+
+## 1. Purpose
+
+A secure, multi-tenant API gateway in front of an Ollama instance currently exposed at `https://api.neuronetz.ai`. The Ollama endpoint must never be reachable directly from the public internet again. All access flows through this gateway.
+
+The gateway is the **hot path** of the Neuronetz API. A separate service (`neuronetz-console`, built on the Nibiru PHP framework) handles administration, dashboards, and tenant self-service. This SPEC covers only the gateway.
+
+## 2. Scope
+
+### In scope (v0.1.0)
+
+- Authentication via API keys (Bearer tokens)
+- Multi-tenant data model (tenants → keys, with inheritance)
+- Per-key and per-tenant rate limiting (RPM, TPM, concurrent)
+- Per-key and per-tenant token budgets (daily, monthly, total)
+- Streaming and non-streaming proxy to Ollama
+- Dual API surface: native Ollama (`/api/*`) and OpenAI-compatible (`/v1/*`)
+- Endpoint allowlist (block all model-mutating Ollama endpoints)
+- **Dynamic model discovery** from the Ollama backend — the live set of installed models is queried, cached, and auto-refreshed; nothing about the model list is hand-maintained
+- Model allowlist (per-tenant override), **default-deny, resolved against the live discovered set** (stale/typo'd entries never resolve)
+- **Per-tenant `allow_all_models` toggle** — opt-in: a flagged tenant may use any currently-installed model, so models newly pulled into Ollama are auto-granted on the next discovery refresh
+- Request size limits, response size limits, timeouts
+- Token counting from Ollama responses (precise, not heuristic)
+- Audit log (always-on metadata)
+- Prompt log (opt-in per key, TTL'd retention)
+- Bootstrap CLI: create tenants, keys, set budgets
+- Health and readiness endpoints
+- Docker Compose deployment (gateway + caddy + postgres + redis + ollama)
+- Caddy as TLS terminator (Let's Encrypt for `api.neuronetz.ai`)
+
+### Out of scope (v0.1.0, document as future)
+
+- Web admin UI (lives in `neuronetz-console`, separate repo)
+- Billing / Stripe integration (budgets only, no money yet)
+- Multi-region / HA / k8s
+- Content moderation / prompt-injection filtering
+- Response caching
+- Multi-backend routing (one Ollama; pluggable backend interface stays for later)
+- Webhook notifications
+- SSO / OAuth2 for admin
+
+## 3. Threat Model (abbreviated)
+
+| Threat | Mitigation |
+|---|---|
+| Internet scanners hitting Ollama directly | Ollama bound to internal Docker network; never published |
+| Unauthenticated API abuse | Mandatory Bearer token; fail-closed on auth errors |
+| API key brute force | Argon2id hashing; constant-time compare; rate limit on auth failures per source IP |
+| GPU/token exhaustion (cost attack) | Per-key TPM + token budget; per-tenant ceiling; concurrent connection cap |
+| Resource exhaustion via large payloads | Request body size limit (default 256 KiB); `num_predict` cap (default 4096) |
+| Model enumeration / training-data exfil via uncommon models | Model allowlist; default-deny. `allow_all_models` is **opt-in per tenant and audited**. Discovery only ever exposes models actually installed on the backend; `/api/tags` and `/v1/models` never reveal models outside the tenant's effective set; "not allowed" and "doesn't exist" return the same generic response |
+| Discovery backend unreachable | Fail-closed: an empty/stale-expired discovered set means no model resolves, so requests are denied — never "allow because we couldn't list models" |
+| Ollama mutation (model pull/delete) by attacker | Endpoint allowlist; mutating endpoints (`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`) hard-blocked at the gateway |
+| Information disclosure via error messages | Sanitize upstream errors; never proxy Ollama internals to client |
+| Audit log tampering | Append-only at app layer; DB role separation; optional WAL archiving |
+| Prompt data leakage | Prompt logging off by default; opt-in per key; TTL'd; redaction hook |
+| Redis outage causing "fail open" | Fail-closed: if rate-limit/budget backend is unavailable, deny |
+| Compromised admin token | Admin token lives in `neuronetz-console`, not in gateway; gateway has no admin endpoints |
+
+## 4. Architecture
+
+### 4.1 Component diagram
+
+```
+                          Internet
+                              │ TLS
+                              ▼
+                  ┌──────────────────────┐
+                  │ Caddy (sidecar)      │  Let's Encrypt for api.neuronetz.ai
+                  │ - TLS termination    │  HSTS, security headers
+                  │ - HTTP/2, HTTP/3     │
+                  └──────────┬───────────┘
+                             │ HTTP/1.1 internal
+                  ┌──────────▼───────────┐
+                  │ neuronetz-gateway    │  FastAPI + uvicorn
+                  │  - authn             │
+                  │  - rate limit        │
+                  │  - budget check      │
+                  │  - proxy + stream    │
+                  │  - token count       │
+                  │  - audit write       │
+                  └──┬────────┬──────┬───┘
+                     │        │      │
+              ┌──────▼──┐  ┌──▼───┐  │
+              │Postgres │  │Redis │  │
+              │ schema: │  │ keys │  │
+              │ gateway │  │bucket│  │
+              └─────────┘  └──────┘  │
+                                     │ internal network only
+                              ┌──────▼──────┐
+                              │   Ollama    │
+                              │ 127.0.0.1   │
+                              └─────────────┘
+
+Same Compose stack also hosts (separate from this SPEC):
+  - neuronetz-console (PHP/Nibiru) → reads schema `console`, reads schema `gateway` (SELECT)
+```
+
+### 4.2 Database schemas
+
+**Single Postgres instance, two schemas:**
+
+- `gateway` — owned by the gateway service; gateway role has full DDL
+- `console` — owned by `neuronetz-console` (out of scope here); console role has full DDL
+- Both services connect with their own role. Cross-schema access is explicit GRANT.
+
+**Console role gets `SELECT` on all `gateway.*` tables.** Console writes go only to `console.*` tables. If the console needs to mutate gateway state (e.g. revoke a key), it does so by writing to a `gateway.revocations` outbox table that the gateway tails (see §4.5).
+
+### 4.3 Request lifecycle
+
+1. Caddy terminates TLS, forwards to gateway on internal port.
+2. Gateway middleware extracts `Authorization: Bearer <key>`.
+3. Key prefix (first 12 chars) used as Redis cache key. On miss, lookup `gateway.api_keys` by prefix; verify full key with argon2id `verify`; cache resolved key metadata in Redis (TTL 60s).
+4. Rate limit check (sliding window in Redis, Lua-atomic) — per-key RPM + per-tenant RPM.
+5. Budget check (Redis counter for current period; Postgres ledger is source of truth on reset).
+6. Concurrent-connection semaphore (Redis `INCR` with TTL).
+7. Model allowlist check. Resolve the **effective model set** for the key:
+   `allow_all := key.allow_all_models ?? tenant.allow_all_models`;
+   `effective := discovered` if `allow_all` else `(key.allowed_models ?? tenant.allowed_models) ∩ discovered`,
+   where `discovered` is the cached live model set from discovery (§4.6). The request's
+   `model` must be in `effective`, else a generic 403 with no disclosure of whether the
+   model exists but is unpermitted vs. is not installed.
+8. Endpoint allowlist check.
+9. Request body validation (size, schema, `num_predict` cap).
+10. If OpenAI-compat path, translate request to Ollama schema.
+11. Open httpx async stream to Ollama.
+12. Stream response back to client, accumulating final `prompt_eval_count` + `eval_count`.
+13. On stream close: write `gateway.audit_log` row; decrement budget; release semaphore; if prompt logging enabled, write `gateway.prompt_log` row.
+14. On any failure: sanitized error to client, audit row with status code, semaphore released.
+
+### 4.4 Failure modes (fail-closed)
+
+| Subsystem | If down | Behavior |
+|---|---|---|
+| Postgres (read) | Key lookup fails | 503 with retry-after; no requests proxied |
+| Postgres (write) | Audit write fails | Request still succeeds, audit row buffered in-memory ring (max 1000), drained on recovery; if buffer fills, switch to deny mode |
+| Redis | Rate limit / budget unavailable | 503 — fail closed. Never "allow because we can't check." |
+| Ollama | Upstream unreachable | 502 with retry-after; circuit breaker opens after 5 consecutive failures, half-open after 30s |
+| Caddy | Not a gateway concern | — |
+
+### 4.5 Cache invalidation (key revocation)
+
+Console can revoke a key by inserting into `gateway.revocations(key_id, ts, reason)`. Gateway has a background task (`asyncio.create_task` in lifespan) that:
+- LISTENs on Postgres channel `key_revoked` (gateway emits NOTIFY on its own write path; console emits via INSERT trigger)
+- On notification, evicts the Redis cache entry for that key's prefix
+- This makes revocation effectively immediate (≤ Redis RTT) without cross-service HTTP
+
+### 4.6 Model discovery
+
+The set of usable models is **never hand-maintained**; it is extracted live from the
+Ollama backend.
+
+- A background task (started in lifespan, like the revocation listener) polls Ollama
+  `GET /api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds.
+- The parsed model set (names + sanitized metadata: family, parameter size, quantization,
+  size bytes, modified-at) is cached in Redis under `gateway:models:discovered` with TTL
+  `MODEL_DISCOVERY_CACHE_TTL_S`, and held in-process for hot reads on the request path.
+- On startup an initial fetch runs; if Ollama is unreachable the discovered set is empty.
+- **Fail-closed:** if the discovered set is empty or its cache has expired and cannot be
+  refreshed, no model resolves and requests are denied (consistent with default-deny).
+  Discovery never opens access on failure.
+- "Auto-grant": because the effective set (§4.3 step 7) intersects with `discovered` (or
+  *is* `discovered` when `allow_all_models`), a model pulled into Ollama out-of-band
+  becomes usable to `allow_all` tenants on the next refresh — no per-tenant config change.
+- Discovery is **read-only** against Ollama and uses only the allowlisted `/api/tags`
+  endpoint; it never triggers a model pull.
+
+## 5. Data Model (schema `gateway`)
+
+```sql
+CREATE SCHEMA gateway;
+
+CREATE TYPE gateway.key_status AS ENUM ('active', 'disabled', 'revoked');
+CREATE TYPE gateway.tenant_status AS ENUM ('active', 'suspended', 'closed');
+CREATE TYPE gateway.budget_period AS ENUM ('day', 'month', 'total');
+
+CREATE TABLE gateway.tenants (
+    id              uuid PRIMARY KEY DEFAULT gen_random_uuid(),
+    name            text NOT NULL UNIQUE,
+    status          gateway.tenant_status NOT NULL DEFAULT 'active',
+    created_at      timestamptz NOT NULL DEFAULT now(),
+    metadata        jsonb NOT NULL DEFAULT '{}'::jsonb
+);
+
+CREATE TABLE gateway.tenant_limits (
+    tenant_id           uuid PRIMARY KEY REFERENCES gateway.tenants(id) ON DELETE CASCADE,
+    rpm                 integer NOT NULL DEFAULT 60,
+    tpm                 integer NOT NULL DEFAULT 100000,
+    concurrent          integer NOT NULL DEFAULT 8,
+    tokens_daily        bigint,
+    tokens_monthly      bigint,
+    tokens_total        bigint,
+    allowed_models      text[] NOT NULL DEFAULT '{}',
+    allow_all_models    boolean NOT NULL DEFAULT false,  -- opt-in: allow any installed model
+    log_prompts_default boolean NOT NULL DEFAULT false,
+    prompt_retention_days integer NOT NULL DEFAULT 30,
+    audit_retention_days  integer NOT NULL DEFAULT 365
+);
+
+CREATE TABLE gateway.api_keys (
+    id              uuid PRIMARY KEY DEFAULT gen_random_uuid(),
+    tenant_id       uuid NOT NULL REFERENCES gateway.tenants(id) ON DELETE CASCADE,
+    prefix          text NOT NULL UNIQUE,          -- first 12 chars, indexed
+    key_hash        text NOT NULL,                  -- argon2id
+    name            text NOT NULL,
+    status          gateway.key_status NOT NULL DEFAULT 'active',
+    scopes          text[] NOT NULL DEFAULT '{chat,embeddings}',
+    created_at      timestamptz NOT NULL DEFAULT now(),
+    last_used_at    timestamptz,
+    expires_at      timestamptz,
+    log_prompts     boolean,                        -- NULL = inherit from tenant
+    metadata        jsonb NOT NULL DEFAULT '{}'::jsonb
+);
+
+CREATE INDEX idx_api_keys_prefix ON gateway.api_keys(prefix) WHERE status = 'active';
+CREATE INDEX idx_api_keys_tenant ON gateway.api_keys(tenant_id);
+
+CREATE TABLE gateway.key_limits (
+    key_id              uuid PRIMARY KEY REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
+    rpm                 integer,            -- NULL = inherit tenant
+    tpm                 integer,
+    concurrent          integer,
+    tokens_daily        bigint,
+    tokens_monthly      bigint,
+    tokens_total        bigint,
+    allowed_models      text[],             -- NULL = inherit tenant
+    allow_all_models    boolean             -- NULL = inherit tenant
+);
+
+CREATE TABLE gateway.budget_usage (
+    key_id          uuid NOT NULL REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
+    period          gateway.budget_period NOT NULL,
+    period_start    timestamptz NOT NULL,
+    tokens_in       bigint NOT NULL DEFAULT 0,
+    tokens_out      bigint NOT NULL DEFAULT 0,
+    requests        bigint NOT NULL DEFAULT 0,
+    PRIMARY KEY (key_id, period, period_start)
+);
+
+CREATE INDEX idx_budget_usage_period ON gateway.budget_usage(period, period_start);
+
+CREATE TABLE gateway.audit_log (
+    id              bigserial PRIMARY KEY,
+    ts              timestamptz NOT NULL DEFAULT now(),
+    request_id      uuid NOT NULL,
+    tenant_id       uuid,                          -- nullable for auth-failed rows
+    key_id          uuid,
+    key_prefix      text,                          -- denormalized for forensic queries
+    method          text NOT NULL,
+    path            text NOT NULL,
+    model           text,
+    tokens_in       integer,
+    tokens_out      integer,
+    latency_ms      integer,
+    status          integer NOT NULL,
+    client_ip       inet,
+    user_agent      text,
+    error_code      text
+);
+
+CREATE INDEX idx_audit_ts ON gateway.audit_log(ts);
+CREATE INDEX idx_audit_tenant_ts ON gateway.audit_log(tenant_id, ts);
+CREATE INDEX idx_audit_key_ts ON gateway.audit_log(key_id, ts);
+
+CREATE TABLE gateway.prompt_log (
+    id              bigserial PRIMARY KEY,
+    audit_id        bigint NOT NULL REFERENCES gateway.audit_log(id) ON DELETE CASCADE,
+    ts              timestamptz NOT NULL DEFAULT now(),
+    key_id          uuid NOT NULL,
+    request_body    jsonb NOT NULL,
+    response_text   text,
+    retention_until timestamptz NOT NULL
+);
+
+CREATE INDEX idx_prompt_log_retention ON gateway.prompt_log(retention_until);
+
+CREATE TABLE gateway.revocations (
+    id              bigserial PRIMARY KEY,
+    key_id          uuid NOT NULL,
+    ts              timestamptz NOT NULL DEFAULT now(),
+    reason          text,
+    processed_at    timestamptz
+);
+
+-- Trigger to NOTIFY on revocation insert
+CREATE OR REPLACE FUNCTION gateway.notify_key_revoked() RETURNS trigger AS $$
+BEGIN
+    PERFORM pg_notify('key_revoked', NEW.key_id::text);
+    RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE TRIGGER trg_notify_key_revoked
+    AFTER INSERT ON gateway.revocations
+    FOR EACH ROW EXECUTE FUNCTION gateway.notify_key_revoked();
+
+-- Grants for console role (created in console SPEC, referenced here)
+-- GRANT USAGE ON SCHEMA gateway TO console_role;
+-- GRANT SELECT ON ALL TABLES IN SCHEMA gateway TO console_role;
+-- GRANT INSERT ON gateway.revocations TO console_role;
+```
+
+## 6. API Surface
+
+### 6.1 Native Ollama passthrough (allowlisted)
+
+| Path | Method | Notes |
+|---|---|---|
+| `/api/chat` | POST | Streamed (NDJSON) and non-streamed |
+| `/api/generate` | POST | Streamed (NDJSON) and non-streamed |
+| `/api/embeddings` | POST | Non-streamed |
+| `/api/embed` | POST | Newer Ollama embeddings endpoint |
+| `/api/tags` | GET | Returns the tenant's **effective** model set (live-discovered ∩ allowed, or *all* discovered when `allow_all_models`). Sourced from discovery (§4.6), never a static list |
+| `/api/show` | POST | Allowed only for models in the tenant's effective set; returns sanitized model info (no system prompts, no template) |
+| `/api/ps` | GET | **Blocked** — leaks loaded models |
+| `/api/version` | GET | Returns gateway version, not Ollama version |
+
+### 6.2 Hard-blocked Ollama endpoints (always 403)
+
+`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`, `/api/blobs/*`
+
+### 6.3 OpenAI-compatible
+
+| Path | Method | Maps to |
+|---|---|---|
+| `/v1/chat/completions` | POST | `/api/chat` |
+| `/v1/completions` | POST | `/api/generate` |
+| `/v1/embeddings` | POST | `/api/embed` |
+| `/v1/models` | GET | `/api/tags` (the tenant's effective discovered set), in OpenAI model-list format |
+
+Translation must preserve streaming. SSE (`data: {...}\n\n`) for OpenAI-compat; NDJSON for native.
+
+### 6.4 Gateway endpoints
+
+| Path | Method | Auth | Purpose |
+|---|---|---|---|
+| `/healthz` | GET | none | Liveness — process responsive |
+| `/readyz` | GET | none | Readiness — DB + Redis + Ollama all reachable |
+| `/metrics` | GET | none (loopback only) | Prometheus exposition (counters, histograms) |
+
+No admin endpoints. Admin lives in `neuronetz-console`.
+
+### 6.5 Response headers
+
+Every proxied response carries:
+- `X-Request-ID: <uuid>`
+- `X-RateLimit-Limit-Requests: <n>`
+- `X-RateLimit-Remaining-Requests: <n>`
+- `X-RateLimit-Limit-Tokens: <n>`
+- `X-RateLimit-Remaining-Tokens: <n>`
+- `X-Budget-Period: day|month|total`
+- `X-Budget-Tokens-Remaining: <n>`
+
+429 responses additionally carry `Retry-After: <seconds>`.
+
+## 7. Configuration
+
+All via environment variables, validated by Pydantic Settings on boot. Boot fails loudly on invalid config.
+
+```
+# Service
+GATEWAY_BIND_HOST=0.0.0.0
+GATEWAY_BIND_PORT=8080
+GATEWAY_LOG_LEVEL=INFO
+GATEWAY_LOG_FORMAT=json                  # json|console
+GATEWAY_REQUEST_ID_HEADER=X-Request-ID
+GATEWAY_TRUSTED_PROXIES=127.0.0.1,caddy  # for X-Forwarded-For
+
+# Upstream
+OLLAMA_BASE_URL=http://ollama:11434
+OLLAMA_CONNECT_TIMEOUT_S=5
+OLLAMA_READ_TIMEOUT_S=600
+OLLAMA_MAX_CONNECTIONS=64
+
+# Model discovery (§4.6)
+MODEL_DISCOVERY_REFRESH_S=60             # how often to re-query Ollama /api/tags
+MODEL_DISCOVERY_CACHE_TTL_S=120          # Redis cache TTL for the discovered model set
+
+# Database
+DATABASE_URL=postgresql+asyncpg://gateway:...@postgres:5432/neuronetz
+DATABASE_POOL_SIZE=10
+DATABASE_POOL_OVERFLOW=20
+
+# Redis
+REDIS_URL=redis://redis:6379/0
+REDIS_KEY_CACHE_TTL_S=60
+
+# Limits (defaults; per-tenant/key overrides in DB)
+DEFAULT_RPM=60
+DEFAULT_TPM=100000
+DEFAULT_CONCURRENT=8
+MAX_REQUEST_BODY_BYTES=262144
+MAX_NUM_PREDICT=4096
+
+# Security
+ARGON2_TIME_COST=3
+ARGON2_MEMORY_COST_KIB=65536
+ARGON2_PARALLELISM=4
+AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN=20
+
+# Audit
+AUDIT_BUFFER_SIZE=1000
+PROMPT_LOG_DEFAULT_RETENTION_DAYS=30
+AUDIT_LOG_DEFAULT_RETENTION_DAYS=365
+```
+
+## 8. Repository Layout
+
+```
+neuronetz-gateway/
+├── pyproject.toml                # uv-managed, ruff, mypy --strict, pytest
+├── README.md
+├── LICENSE                       # Apache 2.0
+├── docker-compose.yml            # full stack incl. console placeholder
+├── docker-compose.dev.yml        # without caddy, gateway exposed on localhost
+├── Dockerfile                    # multi-stage, python:3.12-slim base
+├── .env.example
+├── .dockerignore
+├── .gitignore
+├── alembic.ini
+├── alembic/
+│   ├── env.py
+│   └── versions/
+│       └── 0001_initial.py       # creates schema `gateway` and all tables
+├── ops/
+│   ├── caddy/
+│   │   └── Caddyfile.example
+│   └── systemd/
+│       └── neuronetz-gateway.service
+├── src/neuronetz_gateway/
+│   ├── __init__.py
+│   ├── __main__.py               # uvicorn entry
+│   ├── app.py                    # FastAPI factory
+│   ├── config.py                 # Pydantic Settings
+│   ├── deps.py                   # DI providers
+│   ├── lifespan.py               # startup/shutdown, NOTIFY listener
+│   ├── errors.py                 # exception types, handlers, sanitization
+│   ├── auth/
+│   │   ├── __init__.py
+│   │   ├── hashing.py            # argon2id wrapper
+│   │   ├── keys.py               # key generation, prefix, verify
+│   │   └── middleware.py
+│   ├── ratelimit/
+│   │   ├── __init__.py
+│   │   ├── sliding_window.py     # Redis Lua script
+│   │   └── concurrency.py        # semaphore via Redis
+│   ├── budget/
+│   │   ├── __init__.py
+│   │   ├── counter.py            # Redis period counters
+│   │   └── ledger.py             # Postgres reconciliation
+│   ├── proxy/
+│   │   ├── __init__.py
+│   │   ├── ollama.py             # httpx streaming client
+│   │   ├── translate.py          # OpenAI <-> Ollama schemas
+│   │   ├── token_counter.py      # parse usage from stream
+│   │   ├── discovery.py          # live model discovery from Ollama /api/tags (§4.6)
+│   │   └── allowlist.py          # effective-set resolution (allow_all / allowed ∩ discovered)
+│   ├── routes/
+│   │   ├── __init__.py
+│   │   ├── ollama_native.py
+│   │   ├── openai_compat.py
+│   │   └── health.py
+│   ├── db/
+│   │   ├── __init__.py
+│   │   ├── session.py
+│   │   ├── models.py             # SQLAlchemy 2.0
+│   │   └── repositories.py
+│   ├── audit/
+│   │   ├── __init__.py
+│   │   ├── writer.py             # buffered async writer
+│   │   └── prompt_log.py
+│   ├── observability/
+│   │   ├── __init__.py
+│   │   ├── logging.py            # structlog config
+│   │   └── metrics.py            # prometheus
+│   └── cli/
+│       ├── __init__.py
+│       └── manage.py             # typer: create-tenant, create-key, ...
+├── tests/
+│   ├── conftest.py               # testcontainers fixtures
+│   ├── unit/
+│   │   ├── test_hashing.py
+│   │   ├── test_translate.py
+│   │   ├── test_token_counter.py
+│   │   ├── test_discovery.py
+│   │   ├── test_allowlist.py
+│   │   └── test_sliding_window.py
+│   ├── integration/
+│   │   ├── test_auth_flow.py
+│   │   ├── test_rate_limit.py
+│   │   ├── test_budget.py
+│   │   ├── test_proxy_stream.py
+│   │   ├── test_openai_compat.py
+│   │   ├── test_revocation.py
+│   │   └── mock_ollama.py        # FastAPI mock with NDJSON/SSE
+│   └── load/
+│       └── locustfile.py
+└── docs/
+    ├── ARCHITECTURE.md
+    ├── DEPLOYMENT.md
+    ├── API.md
+    ├── THREAT_MODEL.md
+    └── OPERATIONS.md              # runbook: revoke key, rotate, check usage
+```
+
+## 9. Non-Functional Requirements
+
+- **Performance:** p50 overhead < 5 ms over direct Ollama call (auth + ratelimit + audit); p99 < 25 ms (excluding upstream latency)
+- **Streaming:** Time-to-first-byte must not be degraded by gateway logic — audit write happens **after** stream close
+- **Memory:** Steady-state RSS < 200 MiB per gateway worker under 100 concurrent streams
+- **Concurrency:** Handle 200 concurrent connections per worker; 4 workers per instance default
+- **Test coverage:** ≥ 85% line coverage on `src/neuronetz_gateway/` excluding `__main__` and CLI; 100% on `auth/`, `ratelimit/`, `budget/`
+- **Security:** No `eval`, no `exec`, no shell-out, no `pickle`. Bandit clean. `pip-audit` clean on every CI run.
+- **Type safety:** `mypy --strict` clean
+- **Lint:** `ruff check` clean with project ruleset (E, F, I, B, UP, S, ASYNC)
+
+## 10. Tooling
+
+- Python 3.12
+- `uv` for dependency management (pyproject.toml + uv.lock)
+- FastAPI ≥ 0.115, uvicorn[standard], httpx ≥ 0.27, SQLAlchemy 2.0 (async), asyncpg, redis ≥ 5.0 (with hiredis), structlog, pydantic ≥ 2.9, pydantic-settings, argon2-cffi, typer, prometheus-client
+- Test: pytest, pytest-asyncio, pytest-cov, testcontainers, httpx (test client), respx (mock), locust
+- Lint/format: ruff, mypy --strict, bandit, pip-audit
+- CI: GitHub Actions workflow (lint, type, test with coverage, build image, push on tag)
+
+## 11. Bootstrap CLI (Typer)
+
+```
+neuronetz-gateway create-tenant --name "acme" [--rpm 60] [--tpm 100000]
+neuronetz-gateway create-key --tenant acme --name "prod-server-1" [--scopes chat,embeddings]
+neuronetz-gateway revoke-key --prefix nz_abc12345
+neuronetz-gateway list-keys --tenant acme
+neuronetz-gateway show-usage --tenant acme [--period day|month|total]
+neuronetz-gateway set-budget --key nz_abc12345 --daily 1000000 --monthly 30000000
+neuronetz-gateway set-models --tenant acme --models llama3.1:8b,mistral:7b
+neuronetz-gateway set-models --tenant acme --allow-all          # opt into allow_all_models
+neuronetz-gateway set-models --tenant acme --no-allow-all       # back to explicit allowlist
+neuronetz-gateway list-models [--tenant acme]                   # show live-discovered models
+                                                                # (and the tenant's effective set)
+```
+
+`create-tenant` accepts `--allow-all-models / --no-allow-all-models` (default off).
+`list-models` reads the discovery cache (§4.6); with `--tenant` it also shows that tenant's
+resolved effective set.
+
+Key format: `nz_<12-char-prefix><32-char-random>`. Prefix is stored; full key is hashed (argon2id). On creation, the full key is printed exactly once.
+
+## 12. Acceptance Criteria
+
+The build is "done" when every box below is checked. The orchestrator must verify each before declaring v0.1.0.
+
+- [ ] `docker compose up` from a clean checkout produces a running stack with TLS via Caddy (self-signed in dev, Let's Encrypt-ready in prod).
+- [ ] CLI creates tenant and key; printed key successfully authenticates an `/api/chat` call.
+- [ ] Unauthenticated request returns 401 with no Ollama details leaked.
+- [ ] Request to `/api/pull` returns 403 with generic error message.
+- [ ] Streaming `/api/chat` works end-to-end; first byte arrives within Ollama's own TTFB + < 10 ms gateway overhead.
+- [ ] Streaming `/v1/chat/completions` returns valid SSE with `data: [DONE]` terminator.
+- [ ] Token counts in audit log match Ollama's reported `prompt_eval_count` + `eval_count` exactly.
+- [ ] `/api/tags` and `/v1/models` reflect the **live** Ollama model set (discovery, §4.6): an `allow_all_models` tenant sees every installed model and a newly-pulled model appears within one refresh interval; a default-deny tenant sees only `allowed_models ∩ discovered`; a request for a model outside the effective set returns a generic 403; with discovery unavailable, requests fail closed (deny), not open.
+- [ ] Rate limit triggers at configured RPM with `Retry-After` header.
+- [ ] Token budget enforces and blocks at zero remaining with descriptive error.
+- [ ] Redis outage causes 503 (fail-closed), not 200.
+- [ ] Revocation via `INSERT INTO gateway.revocations` evicts Redis cache within 1 second.
+- [ ] `mypy --strict`, `ruff check`, `bandit`, `pip-audit` all clean in CI.
+- [ ] Test coverage ≥ 85% overall, 100% in `auth/`, `ratelimit/`, `budget/`.
+- [ ] `docs/THREAT_MODEL.md`, `docs/DEPLOYMENT.md`, `docs/OPERATIONS.md` present and accurate.
+- [ ] Load test (locust): 100 concurrent users sustained 5 minutes, p99 gateway overhead < 25 ms, zero 5xx outside induced failures.
+
+## 13. Open Questions (decide during build)
+
+1. Embedding cost accounting — Ollama doesn't return `eval_count` for embeddings. Decision: charge based on `prompt_eval_count` only; document as such.
+2. SSE vs NDJSON heuristic for OpenAI-compat — always SSE per OpenAI spec. NDJSON only on native `/api/*`.
+3. Prometheus cardinality — do not label by `key_id` (too many series); label by `tenant_id` only; per-key data lives in Postgres.
+4. **Model discovery source** — the live model list is `GET /api/tags` on the Ollama backend; there is no separate registry. Cached in Redis + in-process, refreshed every `MODEL_DISCOVERY_REFRESH_S`.
+5. **Discovery failure is fail-closed** — empty/expired discovered set ⇒ no model resolves ⇒ deny. Discovery never opens access on error.
+6. **No existence disclosure** — a model that is installed-but-unpermitted and a model that is not installed both return the same generic response, to prevent enumeration.
+7. **`allow_all_models` precedence** — key-level `allow_all_models` (when non-NULL) overrides the tenant flag; otherwise the tenant flag applies. Same NULL-inherits-tenant rule as the other key limits.
+
+## 14. References
+
+- Ollama API: https://github.com/ollama/ollama/blob/main/docs/api.md
+- OpenAI Chat Completions: https://platform.openai.com/docs/api-reference/chat
+- Nibiru (sibling console project): https://nibiru-framework.com
+- Argon2 RFC 9106