# neuronetz-gateway — SPEC.md **Project:** `neuronetz-gateway` **Version:** 0.1.0 (target) **Status:** Specification — not yet implemented **License:** Apache 2.0 **Owner:** Stephan Berbig / Neuronetz --- ## 1. Purpose A secure, multi-tenant API gateway in front of an Ollama instance currently exposed at `https://api.neuronetz.ai`. The Ollama endpoint must never be reachable directly from the public internet again. All access flows through this gateway. The gateway is the **hot path** of the Neuronetz API. A separate service (`neuronetz-console`, built on the Nibiru PHP framework) handles administration, dashboards, and tenant self-service. This SPEC covers only the gateway. ## 2. Scope ### In scope (v0.1.0) - Authentication via API keys (Bearer tokens) - Multi-tenant data model (tenants → keys, with inheritance) - Per-key and per-tenant rate limiting (RPM, TPM, concurrent) - Per-key and per-tenant token budgets (daily, monthly, total) - Streaming and non-streaming proxy to Ollama - Dual API surface: native Ollama (`/api/*`) and OpenAI-compatible (`/v1/*`) - Endpoint allowlist (block all model-mutating Ollama endpoints) - **Dynamic model discovery** from the Ollama backend — the live set of installed models is queried, cached, and auto-refreshed; nothing about the model list is hand-maintained - Model allowlist (per-tenant override), **default-deny, resolved against the live discovered set** (stale/typo'd entries never resolve) - **Per-tenant `allow_all_models` toggle** — opt-in: a flagged tenant may use any currently-installed model, so models newly pulled into Ollama are auto-granted on the next discovery refresh - Request size limits, response size limits, timeouts - Token counting from Ollama responses (precise, not heuristic) - Audit log (always-on metadata) - Prompt log (opt-in per key, TTL'd retention) - Bootstrap CLI: create tenants, keys, set budgets - Health and readiness endpoints - Docker Compose deployment (gateway + caddy + postgres + redis + ollama) - Caddy as TLS terminator (Let's Encrypt for `api.neuronetz.ai`) ### Out of scope (v0.1.0, document as future) - Web admin UI (lives in `neuronetz-console`, separate repo) - Billing / Stripe integration (budgets only, no money yet) - Multi-region / HA / k8s - Content moderation / prompt-injection filtering - Response caching - Multi-backend routing (one Ollama; pluggable backend interface stays for later) - Webhook notifications - SSO / OAuth2 for admin ## 3. Threat Model (abbreviated) | Threat | Mitigation | |---|---| | Internet scanners hitting Ollama directly | Ollama bound to internal Docker network; never published | | Unauthenticated API abuse | Mandatory Bearer token; fail-closed on auth errors | | API key brute force | Argon2id hashing; constant-time compare; rate limit on auth failures per source IP | | GPU/token exhaustion (cost attack) | Per-key TPM + token budget; per-tenant ceiling; concurrent connection cap | | Resource exhaustion via large payloads | Request body size limit (default 256 KiB); `num_predict` cap (default 4096) | | Model enumeration / training-data exfil via uncommon models | Model allowlist; default-deny. `allow_all_models` is **opt-in per tenant and audited**. Discovery only ever exposes models actually installed on the backend; `/api/tags` and `/v1/models` never reveal models outside the tenant's effective set; "not allowed" and "doesn't exist" return the same generic response | | Discovery backend unreachable | Fail-closed: an empty/stale-expired discovered set means no model resolves, so requests are denied — never "allow because we couldn't list models" | | Ollama mutation (model pull/delete) by attacker | Endpoint allowlist; mutating endpoints (`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`) hard-blocked at the gateway | | Information disclosure via error messages | Sanitize upstream errors; never proxy Ollama internals to client | | Audit log tampering | Append-only at app layer; DB role separation; optional WAL archiving | | Prompt data leakage | Prompt logging off by default; opt-in per key; TTL'd; redaction hook | | Redis outage causing "fail open" | Fail-closed: if rate-limit/budget backend is unavailable, deny | | Compromised admin token | Admin token lives in `neuronetz-console`, not in gateway; gateway has no admin endpoints | ## 4. Architecture ### 4.1 Component diagram ``` Internet │ TLS ▼ ┌──────────────────────┐ │ Caddy (sidecar) │ Let's Encrypt for api.neuronetz.ai │ - TLS termination │ HSTS, security headers │ - HTTP/2, HTTP/3 │ └──────────┬───────────┘ │ HTTP/1.1 internal ┌──────────▼───────────┐ │ neuronetz-gateway │ FastAPI + uvicorn │ - authn │ │ - rate limit │ │ - budget check │ │ - proxy + stream │ │ - token count │ │ - audit write │ └──┬────────┬──────┬───┘ │ │ │ ┌──────▼──┐ ┌──▼───┐ │ │Postgres │ │Redis │ │ │ schema: │ │ keys │ │ │ gateway │ │bucket│ │ └─────────┘ └──────┘ │ │ internal network only ┌──────▼──────┐ │ Ollama │ │ 127.0.0.1 │ └─────────────┘ Same Compose stack also hosts (separate from this SPEC): - neuronetz-console (PHP/Nibiru) → reads schema `console`, reads schema `gateway` (SELECT) ``` ### 4.2 Database schemas **Single Postgres instance, two schemas:** - `gateway` — owned by the gateway service; gateway role has full DDL - `console` — owned by `neuronetz-console` (out of scope here); console role has full DDL - Both services connect with their own role. Cross-schema access is explicit GRANT. **Console role gets `SELECT` on all `gateway.*` tables.** Console writes go only to `console.*` tables. If the console needs to mutate gateway state (e.g. revoke a key), it does so by writing to a `gateway.revocations` outbox table that the gateway tails (see §4.5). ### 4.3 Request lifecycle 1. Caddy terminates TLS, forwards to gateway on internal port. 2. Gateway middleware extracts `Authorization: Bearer `. 3. Key prefix (first 12 chars) used as Redis cache key. On miss, lookup `gateway.api_keys` by prefix; verify full key with argon2id `verify`; cache resolved key metadata in Redis (TTL 60s). 4. Rate limit check (sliding window in Redis, Lua-atomic) — per-key RPM + per-tenant RPM. 5. Budget check (Redis counter for current period; Postgres ledger is source of truth on reset). 6. Concurrent-connection semaphore (Redis `INCR` with TTL). 7. Model allowlist check. Resolve the **effective model set** for the key: `allow_all := key.allow_all_models ?? tenant.allow_all_models`; `effective := discovered` if `allow_all` else `(key.allowed_models ?? tenant.allowed_models) ∩ discovered`, where `discovered` is the cached live model set from discovery (§4.6). The request's `model` must be in `effective`, else a generic 403 with no disclosure of whether the model exists but is unpermitted vs. is not installed. 8. Endpoint allowlist check. 9. Request body validation (size, schema, `num_predict` cap). 10. If OpenAI-compat path, translate request to Ollama schema. 11. Open httpx async stream to Ollama. 12. Stream response back to client, accumulating final `prompt_eval_count` + `eval_count`. 13. On stream close: write `gateway.audit_log` row; decrement budget; release semaphore; if prompt logging enabled, write `gateway.prompt_log` row. 14. On any failure: sanitized error to client, audit row with status code, semaphore released. ### 4.4 Failure modes (fail-closed) | Subsystem | If down | Behavior | |---|---|---| | Postgres (read) | Key lookup fails | 503 with retry-after; no requests proxied | | Postgres (write) | Audit write fails | Request still succeeds, audit row buffered in-memory ring (max 1000), drained on recovery; if buffer fills, switch to deny mode | | Redis | Rate limit / budget unavailable | 503 — fail closed. Never "allow because we can't check." | | Ollama | Upstream unreachable | 502 with retry-after; circuit breaker opens after 5 consecutive failures, half-open after 30s | | Caddy | Not a gateway concern | — | ### 4.5 Cache invalidation (key revocation) Console can revoke a key by inserting into `gateway.revocations(key_id, ts, reason)`. Gateway has a background task (`asyncio.create_task` in lifespan) that: - LISTENs on Postgres channel `key_revoked` (gateway emits NOTIFY on its own write path; console emits via INSERT trigger) - On notification, evicts the Redis cache entry for that key's prefix - This makes revocation effectively immediate (≤ Redis RTT) without cross-service HTTP ### 4.6 Model discovery The set of usable models is **never hand-maintained**; it is extracted live from the Ollama backend. - A background task (started in lifespan, like the revocation listener) polls Ollama `GET /api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds. - The parsed model set (names + sanitized metadata: family, parameter size, quantization, size bytes, modified-at) is cached in Redis under `gateway:models:discovered` with TTL `MODEL_DISCOVERY_CACHE_TTL_S`, and held in-process for hot reads on the request path. - On startup an initial fetch runs; if Ollama is unreachable the discovered set is empty. - **Fail-closed:** if the discovered set is empty or its cache has expired and cannot be refreshed, no model resolves and requests are denied (consistent with default-deny). Discovery never opens access on failure. - "Auto-grant": because the effective set (§4.3 step 7) intersects with `discovered` (or *is* `discovered` when `allow_all_models`), a model pulled into Ollama out-of-band becomes usable to `allow_all` tenants on the next refresh — no per-tenant config change. - Discovery is **read-only** against Ollama and uses only the allowlisted `/api/tags` endpoint; it never triggers a model pull. ## 5. Data Model (schema `gateway`) ```sql CREATE SCHEMA gateway; CREATE TYPE gateway.key_status AS ENUM ('active', 'disabled', 'revoked'); CREATE TYPE gateway.tenant_status AS ENUM ('active', 'suspended', 'closed'); CREATE TYPE gateway.budget_period AS ENUM ('day', 'month', 'total'); CREATE TABLE gateway.tenants ( id uuid PRIMARY KEY DEFAULT gen_random_uuid(), name text NOT NULL UNIQUE, status gateway.tenant_status NOT NULL DEFAULT 'active', created_at timestamptz NOT NULL DEFAULT now(), metadata jsonb NOT NULL DEFAULT '{}'::jsonb ); CREATE TABLE gateway.tenant_limits ( tenant_id uuid PRIMARY KEY REFERENCES gateway.tenants(id) ON DELETE CASCADE, rpm integer NOT NULL DEFAULT 60, tpm integer NOT NULL DEFAULT 100000, concurrent integer NOT NULL DEFAULT 8, tokens_daily bigint, tokens_monthly bigint, tokens_total bigint, allowed_models text[] NOT NULL DEFAULT '{}', allow_all_models boolean NOT NULL DEFAULT false, -- opt-in: allow any installed model log_prompts_default boolean NOT NULL DEFAULT false, prompt_retention_days integer NOT NULL DEFAULT 30, audit_retention_days integer NOT NULL DEFAULT 365 ); CREATE TABLE gateway.api_keys ( id uuid PRIMARY KEY DEFAULT gen_random_uuid(), tenant_id uuid NOT NULL REFERENCES gateway.tenants(id) ON DELETE CASCADE, prefix text NOT NULL UNIQUE, -- first 12 chars, indexed key_hash text NOT NULL, -- argon2id name text NOT NULL, status gateway.key_status NOT NULL DEFAULT 'active', scopes text[] NOT NULL DEFAULT '{chat,embeddings}', created_at timestamptz NOT NULL DEFAULT now(), last_used_at timestamptz, expires_at timestamptz, log_prompts boolean, -- NULL = inherit from tenant metadata jsonb NOT NULL DEFAULT '{}'::jsonb ); CREATE INDEX idx_api_keys_prefix ON gateway.api_keys(prefix) WHERE status = 'active'; CREATE INDEX idx_api_keys_tenant ON gateway.api_keys(tenant_id); CREATE TABLE gateway.key_limits ( key_id uuid PRIMARY KEY REFERENCES gateway.api_keys(id) ON DELETE CASCADE, rpm integer, -- NULL = inherit tenant tpm integer, concurrent integer, tokens_daily bigint, tokens_monthly bigint, tokens_total bigint, allowed_models text[], -- NULL = inherit tenant allow_all_models boolean -- NULL = inherit tenant ); CREATE TABLE gateway.budget_usage ( key_id uuid NOT NULL REFERENCES gateway.api_keys(id) ON DELETE CASCADE, period gateway.budget_period NOT NULL, period_start timestamptz NOT NULL, tokens_in bigint NOT NULL DEFAULT 0, tokens_out bigint NOT NULL DEFAULT 0, requests bigint NOT NULL DEFAULT 0, PRIMARY KEY (key_id, period, period_start) ); CREATE INDEX idx_budget_usage_period ON gateway.budget_usage(period, period_start); CREATE TABLE gateway.audit_log ( id bigserial PRIMARY KEY, ts timestamptz NOT NULL DEFAULT now(), request_id uuid NOT NULL, tenant_id uuid, -- nullable for auth-failed rows key_id uuid, key_prefix text, -- denormalized for forensic queries method text NOT NULL, path text NOT NULL, model text, tokens_in integer, tokens_out integer, latency_ms integer, status integer NOT NULL, client_ip inet, user_agent text, error_code text ); CREATE INDEX idx_audit_ts ON gateway.audit_log(ts); CREATE INDEX idx_audit_tenant_ts ON gateway.audit_log(tenant_id, ts); CREATE INDEX idx_audit_key_ts ON gateway.audit_log(key_id, ts); CREATE TABLE gateway.prompt_log ( id bigserial PRIMARY KEY, audit_id bigint NOT NULL REFERENCES gateway.audit_log(id) ON DELETE CASCADE, ts timestamptz NOT NULL DEFAULT now(), key_id uuid NOT NULL, request_body jsonb NOT NULL, response_text text, retention_until timestamptz NOT NULL ); CREATE INDEX idx_prompt_log_retention ON gateway.prompt_log(retention_until); CREATE TABLE gateway.revocations ( id bigserial PRIMARY KEY, key_id uuid NOT NULL, ts timestamptz NOT NULL DEFAULT now(), reason text, processed_at timestamptz ); -- Trigger to NOTIFY on revocation insert CREATE OR REPLACE FUNCTION gateway.notify_key_revoked() RETURNS trigger AS $$ BEGIN PERFORM pg_notify('key_revoked', NEW.key_id::text); RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER trg_notify_key_revoked AFTER INSERT ON gateway.revocations FOR EACH ROW EXECUTE FUNCTION gateway.notify_key_revoked(); -- Grants for console role (created in console SPEC, referenced here) -- GRANT USAGE ON SCHEMA gateway TO console_role; -- GRANT SELECT ON ALL TABLES IN SCHEMA gateway TO console_role; -- GRANT INSERT ON gateway.revocations TO console_role; ``` ## 6. API Surface ### 6.1 Native Ollama passthrough (allowlisted) | Path | Method | Notes | |---|---|---| | `/api/chat` | POST | Streamed (NDJSON) and non-streamed | | `/api/generate` | POST | Streamed (NDJSON) and non-streamed | | `/api/embeddings` | POST | Non-streamed | | `/api/embed` | POST | Newer Ollama embeddings endpoint | | `/api/tags` | GET | Returns the tenant's **effective** model set (live-discovered ∩ allowed, or *all* discovered when `allow_all_models`). Sourced from discovery (§4.6), never a static list | | `/api/show` | POST | Allowed only for models in the tenant's effective set; returns sanitized model info (no system prompts, no template) | | `/api/ps` | GET | **Blocked** — leaks loaded models | | `/api/version` | GET | Returns gateway version, not Ollama version | ### 6.2 Hard-blocked Ollama endpoints (always 403) `/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`, `/api/blobs/*` ### 6.3 OpenAI-compatible | Path | Method | Maps to | |---|---|---| | `/v1/chat/completions` | POST | `/api/chat` | | `/v1/completions` | POST | `/api/generate` | | `/v1/embeddings` | POST | `/api/embed` | | `/v1/models` | GET | `/api/tags` (the tenant's effective discovered set), in OpenAI model-list format | Translation must preserve streaming. SSE (`data: {...}\n\n`) for OpenAI-compat; NDJSON for native. ### 6.4 Gateway endpoints | Path | Method | Auth | Purpose | |---|---|---|---| | `/healthz` | GET | none | Liveness — process responsive | | `/readyz` | GET | none | Readiness — DB + Redis + Ollama all reachable | | `/metrics` | GET | none (loopback only) | Prometheus exposition (counters, histograms) | No admin endpoints. Admin lives in `neuronetz-console`. ### 6.5 Response headers Every proxied response carries: - `X-Request-ID: ` - `X-RateLimit-Limit-Requests: ` - `X-RateLimit-Remaining-Requests: ` - `X-RateLimit-Limit-Tokens: ` - `X-RateLimit-Remaining-Tokens: ` - `X-Budget-Period: day|month|total` - `X-Budget-Tokens-Remaining: ` 429 responses additionally carry `Retry-After: `. ## 7. Configuration All via environment variables, validated by Pydantic Settings on boot. Boot fails loudly on invalid config. ``` # Service GATEWAY_BIND_HOST=0.0.0.0 GATEWAY_BIND_PORT=8080 GATEWAY_LOG_LEVEL=INFO GATEWAY_LOG_FORMAT=json # json|console GATEWAY_REQUEST_ID_HEADER=X-Request-ID GATEWAY_TRUSTED_PROXIES=127.0.0.1,caddy # for X-Forwarded-For # Upstream OLLAMA_BASE_URL=http://ollama:11434 OLLAMA_CONNECT_TIMEOUT_S=5 OLLAMA_READ_TIMEOUT_S=600 OLLAMA_MAX_CONNECTIONS=64 # Model discovery (§4.6) MODEL_DISCOVERY_REFRESH_S=60 # how often to re-query Ollama /api/tags MODEL_DISCOVERY_CACHE_TTL_S=120 # Redis cache TTL for the discovered model set # Database DATABASE_URL=postgresql+asyncpg://gateway:...@postgres:5432/neuronetz DATABASE_POOL_SIZE=10 DATABASE_POOL_OVERFLOW=20 # Redis REDIS_URL=redis://redis:6379/0 REDIS_KEY_CACHE_TTL_S=60 # Limits (defaults; per-tenant/key overrides in DB) DEFAULT_RPM=60 DEFAULT_TPM=100000 DEFAULT_CONCURRENT=8 MAX_REQUEST_BODY_BYTES=262144 MAX_NUM_PREDICT=4096 # Security ARGON2_TIME_COST=3 ARGON2_MEMORY_COST_KIB=65536 ARGON2_PARALLELISM=4 AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN=20 # Audit AUDIT_BUFFER_SIZE=1000 PROMPT_LOG_DEFAULT_RETENTION_DAYS=30 AUDIT_LOG_DEFAULT_RETENTION_DAYS=365 ``` ## 8. Repository Layout ``` neuronetz-gateway/ ├── pyproject.toml # uv-managed, ruff, mypy --strict, pytest ├── README.md ├── LICENSE # Apache 2.0 ├── docker-compose.yml # full stack incl. console placeholder ├── docker-compose.dev.yml # without caddy, gateway exposed on localhost ├── Dockerfile # multi-stage, python:3.12-slim base ├── .env.example ├── .dockerignore ├── .gitignore ├── alembic.ini ├── alembic/ │ ├── env.py │ └── versions/ │ └── 0001_initial.py # creates schema `gateway` and all tables ├── ops/ │ ├── caddy/ │ │ └── Caddyfile.example │ └── systemd/ │ └── neuronetz-gateway.service ├── src/neuronetz_gateway/ │ ├── __init__.py │ ├── __main__.py # uvicorn entry │ ├── app.py # FastAPI factory │ ├── config.py # Pydantic Settings │ ├── deps.py # DI providers │ ├── lifespan.py # startup/shutdown, NOTIFY listener │ ├── errors.py # exception types, handlers, sanitization │ ├── auth/ │ │ ├── __init__.py │ │ ├── hashing.py # argon2id wrapper │ │ ├── keys.py # key generation, prefix, verify │ │ └── middleware.py │ ├── ratelimit/ │ │ ├── __init__.py │ │ ├── sliding_window.py # Redis Lua script │ │ └── concurrency.py # semaphore via Redis │ ├── budget/ │ │ ├── __init__.py │ │ ├── counter.py # Redis period counters │ │ └── ledger.py # Postgres reconciliation │ ├── proxy/ │ │ ├── __init__.py │ │ ├── ollama.py # httpx streaming client │ │ ├── translate.py # OpenAI <-> Ollama schemas │ │ ├── token_counter.py # parse usage from stream │ │ ├── discovery.py # live model discovery from Ollama /api/tags (§4.6) │ │ └── allowlist.py # effective-set resolution (allow_all / allowed ∩ discovered) │ ├── routes/ │ │ ├── __init__.py │ │ ├── ollama_native.py │ │ ├── openai_compat.py │ │ └── health.py │ ├── db/ │ │ ├── __init__.py │ │ ├── session.py │ │ ├── models.py # SQLAlchemy 2.0 │ │ └── repositories.py │ ├── audit/ │ │ ├── __init__.py │ │ ├── writer.py # buffered async writer │ │ └── prompt_log.py │ ├── observability/ │ │ ├── __init__.py │ │ ├── logging.py # structlog config │ │ └── metrics.py # prometheus │ └── cli/ │ ├── __init__.py │ └── manage.py # typer: create-tenant, create-key, ... ├── tests/ │ ├── conftest.py # testcontainers fixtures │ ├── unit/ │ │ ├── test_hashing.py │ │ ├── test_translate.py │ │ ├── test_token_counter.py │ │ ├── test_discovery.py │ │ ├── test_allowlist.py │ │ └── test_sliding_window.py │ ├── integration/ │ │ ├── test_auth_flow.py │ │ ├── test_rate_limit.py │ │ ├── test_budget.py │ │ ├── test_proxy_stream.py │ │ ├── test_openai_compat.py │ │ ├── test_revocation.py │ │ └── mock_ollama.py # FastAPI mock with NDJSON/SSE │ └── load/ │ └── locustfile.py └── docs/ ├── ARCHITECTURE.md ├── DEPLOYMENT.md ├── API.md ├── THREAT_MODEL.md └── OPERATIONS.md # runbook: revoke key, rotate, check usage ``` ## 9. Non-Functional Requirements - **Performance:** p50 overhead < 5 ms over direct Ollama call (auth + ratelimit + audit); p99 < 25 ms (excluding upstream latency) - **Streaming:** Time-to-first-byte must not be degraded by gateway logic — audit write happens **after** stream close - **Memory:** Steady-state RSS < 200 MiB per gateway worker under 100 concurrent streams - **Concurrency:** Handle 200 concurrent connections per worker; 4 workers per instance default - **Test coverage:** ≥ 85% line coverage on `src/neuronetz_gateway/` excluding `__main__` and CLI; 100% on `auth/`, `ratelimit/`, `budget/` - **Security:** No `eval`, no `exec`, no shell-out, no `pickle`. Bandit clean. `pip-audit` clean on every CI run. - **Type safety:** `mypy --strict` clean - **Lint:** `ruff check` clean with project ruleset (E, F, I, B, UP, S, ASYNC) ## 10. Tooling - Python 3.12 - `uv` for dependency management (pyproject.toml + uv.lock) - FastAPI ≥ 0.115, uvicorn[standard], httpx ≥ 0.27, SQLAlchemy 2.0 (async), asyncpg, redis ≥ 5.0 (with hiredis), structlog, pydantic ≥ 2.9, pydantic-settings, argon2-cffi, typer, prometheus-client - Test: pytest, pytest-asyncio, pytest-cov, testcontainers, httpx (test client), respx (mock), locust - Lint/format: ruff, mypy --strict, bandit, pip-audit - CI: GitHub Actions workflow (lint, type, test with coverage, build image, push on tag) ## 11. Bootstrap CLI (Typer) ``` neuronetz-gateway create-tenant --name "acme" [--rpm 60] [--tpm 100000] neuronetz-gateway create-key --tenant acme --name "prod-server-1" [--scopes chat,embeddings] neuronetz-gateway revoke-key --prefix nz_abc12345 neuronetz-gateway list-keys --tenant acme neuronetz-gateway show-usage --tenant acme [--period day|month|total] neuronetz-gateway set-budget --key nz_abc12345 --daily 1000000 --monthly 30000000 neuronetz-gateway set-models --tenant acme --models llama3.1:8b,mistral:7b neuronetz-gateway set-models --tenant acme --allow-all # opt into allow_all_models neuronetz-gateway set-models --tenant acme --no-allow-all # back to explicit allowlist neuronetz-gateway list-models [--tenant acme] # show live-discovered models # (and the tenant's effective set) ``` `create-tenant` accepts `--allow-all-models / --no-allow-all-models` (default off). `list-models` reads the discovery cache (§4.6); with `--tenant` it also shows that tenant's resolved effective set. Key format: `nz_<12-char-prefix><32-char-random>`. Prefix is stored; full key is hashed (argon2id). On creation, the full key is printed exactly once. ## 12. Acceptance Criteria The build is "done" when every box below is checked. The orchestrator must verify each before declaring v0.1.0. - [ ] `docker compose up` from a clean checkout produces a running stack with TLS via Caddy (self-signed in dev, Let's Encrypt-ready in prod). - [ ] CLI creates tenant and key; printed key successfully authenticates an `/api/chat` call. - [ ] Unauthenticated request returns 401 with no Ollama details leaked. - [ ] Request to `/api/pull` returns 403 with generic error message. - [ ] Streaming `/api/chat` works end-to-end; first byte arrives within Ollama's own TTFB + < 10 ms gateway overhead. - [ ] Streaming `/v1/chat/completions` returns valid SSE with `data: [DONE]` terminator. - [ ] Token counts in audit log match Ollama's reported `prompt_eval_count` + `eval_count` exactly. - [ ] `/api/tags` and `/v1/models` reflect the **live** Ollama model set (discovery, §4.6): an `allow_all_models` tenant sees every installed model and a newly-pulled model appears within one refresh interval; a default-deny tenant sees only `allowed_models ∩ discovered`; a request for a model outside the effective set returns a generic 403; with discovery unavailable, requests fail closed (deny), not open. - [ ] Rate limit triggers at configured RPM with `Retry-After` header. - [ ] Token budget enforces and blocks at zero remaining with descriptive error. - [ ] Redis outage causes 503 (fail-closed), not 200. - [ ] Revocation via `INSERT INTO gateway.revocations` evicts Redis cache within 1 second. - [ ] `mypy --strict`, `ruff check`, `bandit`, `pip-audit` all clean in CI. - [ ] Test coverage ≥ 85% overall, 100% in `auth/`, `ratelimit/`, `budget/`. - [ ] `docs/THREAT_MODEL.md`, `docs/DEPLOYMENT.md`, `docs/OPERATIONS.md` present and accurate. - [ ] Load test (locust): 100 concurrent users sustained 5 minutes, p99 gateway overhead < 25 ms, zero 5xx outside induced failures. ## 13. Open Questions (decide during build) 1. Embedding cost accounting — Ollama doesn't return `eval_count` for embeddings. Decision: charge based on `prompt_eval_count` only; document as such. 2. SSE vs NDJSON heuristic for OpenAI-compat — always SSE per OpenAI spec. NDJSON only on native `/api/*`. 3. Prometheus cardinality — do not label by `key_id` (too many series); label by `tenant_id` only; per-key data lives in Postgres. 4. **Model discovery source** — the live model list is `GET /api/tags` on the Ollama backend; there is no separate registry. Cached in Redis + in-process, refreshed every `MODEL_DISCOVERY_REFRESH_S`. 5. **Discovery failure is fail-closed** — empty/expired discovered set ⇒ no model resolves ⇒ deny. Discovery never opens access on error. 6. **No existence disclosure** — a model that is installed-but-unpermitted and a model that is not installed both return the same generic response, to prevent enumeration. 7. **`allow_all_models` precedence** — key-level `allow_all_models` (when non-NULL) overrides the tenant flag; otherwise the tenant flag applies. Same NULL-inherits-tenant rule as the other key limits. ## 14. References - Ollama API: https://github.com/ollama/ollama/blob/main/docs/api.md - OpenAI Chat Completions: https://platform.openai.com/docs/api-reference/chat - Nibiru (sibling console project): https://nibiru-framework.com - Argon2 RFC 9106