neuronetz-gateway/scope-docs/SPEC.md

# neuronetz-gateway — SPEC.md

**Project:** `neuronetz-gateway`
**Version:** 0.1.0 (target)
**Status:** Specification — not yet implemented
**License:** Apache 2.0
**Owner:** Stephan Berbig / Neuronetz

---

## 1. Purpose

A secure, multi-tenant API gateway in front of an Ollama instance currently exposed at `https://api.neuronetz.ai`. The Ollama endpoint must never be reachable directly from the public internet again. All access flows through this gateway.

The gateway is the **hot path** of the Neuronetz API. A separate service (`neuronetz-console`, built on the Nibiru PHP framework) handles administration, dashboards, and tenant self-service. This SPEC covers only the gateway.

## 2. Scope

### In scope (v0.1.0)

- Authentication via API keys (Bearer tokens)
- Multi-tenant data model (tenants → keys, with inheritance)
- Per-key and per-tenant rate limiting (RPM, TPM, concurrent)
- Per-key and per-tenant token budgets (daily, monthly, total)
- Streaming and non-streaming proxy to Ollama
- Dual API surface: native Ollama (`/api/*`) and OpenAI-compatible (`/v1/*`)
- Endpoint allowlist (block all model-mutating Ollama endpoints)
- **Dynamic model discovery** from the Ollama backend — the live set of installed models is queried, cached, and auto-refreshed; nothing about the model list is hand-maintained
- Model allowlist (per-tenant override), **default-deny, resolved against the live discovered set** (stale/typo'd entries never resolve)
- **Per-tenant `allow_all_models` toggle** — opt-in: a flagged tenant may use any currently-installed model, so models newly pulled into Ollama are auto-granted on the next discovery refresh
- Request size limits, response size limits, timeouts
- Token counting from Ollama responses (precise, not heuristic)
- Audit log (always-on metadata)
- Prompt log (opt-in per key, TTL'd retention)
- Bootstrap CLI: create tenants, keys, set budgets
- Health and readiness endpoints
- Docker Compose deployment (gateway + caddy + postgres + redis + ollama)
- Caddy as TLS terminator (Let's Encrypt for `api.neuronetz.ai`)

### Out of scope (v0.1.0, document as future)

- Web admin UI (lives in `neuronetz-console`, separate repo)
- Billing / Stripe integration (budgets only, no money yet)
- Multi-region / HA / k8s
- Content moderation / prompt-injection filtering
- Response caching
- Multi-backend routing (one Ollama; pluggable backend interface stays for later)
- Webhook notifications
- SSO / OAuth2 for admin

## 3. Threat Model (abbreviated)

| Threat | Mitigation |
|---|---|
| Internet scanners hitting Ollama directly | Ollama bound to internal Docker network; never published |
| Unauthenticated API abuse | Mandatory Bearer token; fail-closed on auth errors |
| API key brute force | Argon2id hashing; constant-time compare; rate limit on auth failures per source IP |
| GPU/token exhaustion (cost attack) | Per-key TPM + token budget; per-tenant ceiling; concurrent connection cap |
| Resource exhaustion via large payloads | Request body size limit (default 256 KiB); `num_predict` cap (default 4096) |
| Model enumeration / training-data exfil via uncommon models | Model allowlist; default-deny. `allow_all_models` is **opt-in per tenant and audited**. Discovery only ever exposes models actually installed on the backend; `/api/tags` and `/v1/models` never reveal models outside the tenant's effective set; "not allowed" and "doesn't exist" return the same generic response |
| Discovery backend unreachable | Fail-closed: an empty/stale-expired discovered set means no model resolves, so requests are denied — never "allow because we couldn't list models" |
| Ollama mutation (model pull/delete) by attacker | Endpoint allowlist; mutating endpoints (`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`) hard-blocked at the gateway |
| Information disclosure via error messages | Sanitize upstream errors; never proxy Ollama internals to client |
| Audit log tampering | Append-only at app layer; DB role separation; optional WAL archiving |
| Prompt data leakage | Prompt logging off by default; opt-in per key; TTL'd; redaction hook |
| Redis outage causing "fail open" | Fail-closed: if rate-limit/budget backend is unavailable, deny |
| Compromised admin token | Admin token lives in `neuronetz-console`, not in gateway; gateway has no admin endpoints |

## 4. Architecture

### 4.1 Component diagram

```
                          Internet
                              │ TLS
                              ▼
                  ┌──────────────────────┐
                  │ Caddy (sidecar)      │  Let's Encrypt for api.neuronetz.ai
                  │ - TLS termination    │  HSTS, security headers
                  │ - HTTP/2, HTTP/3     │
                  └──────────┬───────────┘
                             │ HTTP/1.1 internal
                  ┌──────────▼───────────┐
                  │ neuronetz-gateway    │  FastAPI + uvicorn
                  │  - authn             │
                  │  - rate limit        │
                  │  - budget check      │
                  │  - proxy + stream    │
                  │  - token count       │
                  │  - audit write       │
                  └──┬────────┬──────┬───┘
                     │        │      │
              ┌──────▼──┐  ┌──▼───┐  │
              │Postgres │  │Redis │  │
              │ schema: │  │ keys │  │
              │ gateway │  │bucket│  │
              └─────────┘  └──────┘  │
                                     │ internal network only
                              ┌──────▼──────┐
                              │   Ollama    │
                              │ 127.0.0.1   │
                              └─────────────┘

Same Compose stack also hosts (separate from this SPEC):
  - neuronetz-console (PHP/Nibiru) → reads schema `console`, reads schema `gateway` (SELECT)
```

### 4.2 Database schemas

**Single Postgres instance, two schemas:**

- `gateway` — owned by the gateway service; gateway role has full DDL
- `console` — owned by `neuronetz-console` (out of scope here); console role has full DDL
- Both services connect with their own role. Cross-schema access is explicit GRANT.

**Console role gets `SELECT` on all `gateway.*` tables.** Console writes go only to `console.*` tables. If the console needs to mutate gateway state (e.g. revoke a key), it does so by writing to a `gateway.revocations` outbox table that the gateway tails (see §4.5).

### 4.3 Request lifecycle

1. Caddy terminates TLS, forwards to gateway on internal port.
2. Gateway middleware extracts `Authorization: Bearer <key>`.
3. Key prefix (first 12 chars) used as Redis cache key. On miss, lookup `gateway.api_keys` by prefix; verify full key with argon2id `verify`; cache resolved key metadata in Redis (TTL 60s).
4. Rate limit check (sliding window in Redis, Lua-atomic) — per-key RPM + per-tenant RPM.
5. Budget check (Redis counter for current period; Postgres ledger is source of truth on reset).
6. Concurrent-connection semaphore (Redis `INCR` with TTL).
7. Model allowlist check. Resolve the **effective model set** for the key:
   `allow_all := key.allow_all_models ?? tenant.allow_all_models`;
   `effective := discovered` if `allow_all` else `(key.allowed_models ?? tenant.allowed_models) ∩ discovered`,
   where `discovered` is the cached live model set from discovery (§4.6). The request's
   `model` must be in `effective`, else a generic 403 with no disclosure of whether the
   model exists but is unpermitted vs. is not installed.
8. Endpoint allowlist check.
9. Request body validation (size, schema, `num_predict` cap).
10. If OpenAI-compat path, translate request to Ollama schema.
11. Open httpx async stream to Ollama.
12. Stream response back to client, accumulating final `prompt_eval_count` + `eval_count`.
13. On stream close: write `gateway.audit_log` row; decrement budget; release semaphore; if prompt logging enabled, write `gateway.prompt_log` row.
14. On any failure: sanitized error to client, audit row with status code, semaphore released.

### 4.4 Failure modes (fail-closed)

| Subsystem | If down | Behavior |
|---|---|---|
| Postgres (read) | Key lookup fails | 503 with retry-after; no requests proxied |
| Postgres (write) | Audit write fails | Request still succeeds, audit row buffered in-memory ring (max 1000), drained on recovery; if buffer fills, switch to deny mode |
| Redis | Rate limit / budget unavailable | 503 — fail closed. Never "allow because we can't check." |
| Ollama | Upstream unreachable | 502 with retry-after; circuit breaker opens after 5 consecutive failures, half-open after 30s |
| Caddy | Not a gateway concern | — |

### 4.5 Cache invalidation (key revocation)

Console can revoke a key by inserting into `gateway.revocations(key_id, ts, reason)`. Gateway has a background task (`asyncio.create_task` in lifespan) that:
- LISTENs on Postgres channel `key_revoked` (gateway emits NOTIFY on its own write path; console emits via INSERT trigger)
- On notification, evicts the Redis cache entry for that key's prefix
- This makes revocation effectively immediate (≤ Redis RTT) without cross-service HTTP

### 4.6 Model discovery

The set of usable models is **never hand-maintained**; it is extracted live from the
Ollama backend.

- A background task (started in lifespan, like the revocation listener) polls Ollama
  `GET /api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds.
- The parsed model set (names + sanitized metadata: family, parameter size, quantization,
  size bytes, modified-at) is cached in Redis under `gateway:models:discovered` with TTL
  `MODEL_DISCOVERY_CACHE_TTL_S`, and held in-process for hot reads on the request path.
- On startup an initial fetch runs; if Ollama is unreachable the discovered set is empty.
- **Fail-closed:** if the discovered set is empty or its cache has expired and cannot be
  refreshed, no model resolves and requests are denied (consistent with default-deny).
  Discovery never opens access on failure.
- "Auto-grant": because the effective set (§4.3 step 7) intersects with `discovered` (or
  *is* `discovered` when `allow_all_models`), a model pulled into Ollama out-of-band
  becomes usable to `allow_all` tenants on the next refresh — no per-tenant config change.
- Discovery is **read-only** against Ollama and uses only the allowlisted `/api/tags`
  endpoint; it never triggers a model pull.

## 5. Data Model (schema `gateway`)

```sql
CREATE SCHEMA gateway;

CREATE TYPE gateway.key_status AS ENUM ('active', 'disabled', 'revoked');
CREATE TYPE gateway.tenant_status AS ENUM ('active', 'suspended', 'closed');
CREATE TYPE gateway.budget_period AS ENUM ('day', 'month', 'total');

CREATE TABLE gateway.tenants (
    id              uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    name            text NOT NULL UNIQUE,
    status          gateway.tenant_status NOT NULL DEFAULT 'active',
    created_at      timestamptz NOT NULL DEFAULT now(),
    metadata        jsonb NOT NULL DEFAULT '{}'::jsonb
);

CREATE TABLE gateway.tenant_limits (
    tenant_id           uuid PRIMARY KEY REFERENCES gateway.tenants(id) ON DELETE CASCADE,
    rpm                 integer NOT NULL DEFAULT 60,
    tpm                 integer NOT NULL DEFAULT 100000,
    concurrent          integer NOT NULL DEFAULT 8,
    tokens_daily        bigint,
    tokens_monthly      bigint,
    tokens_total        bigint,
    allowed_models      text[] NOT NULL DEFAULT '{}',
    allow_all_models    boolean NOT NULL DEFAULT false,  -- opt-in: allow any installed model
    log_prompts_default boolean NOT NULL DEFAULT false,
    prompt_retention_days integer NOT NULL DEFAULT 30,
    audit_retention_days  integer NOT NULL DEFAULT 365
);

CREATE TABLE gateway.api_keys (
    id              uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       uuid NOT NULL REFERENCES gateway.tenants(id) ON DELETE CASCADE,
    prefix          text NOT NULL UNIQUE,          -- first 12 chars, indexed
    key_hash        text NOT NULL,                  -- argon2id
    name            text NOT NULL,
    status          gateway.key_status NOT NULL DEFAULT 'active',
    scopes          text[] NOT NULL DEFAULT '{chat,embeddings}',
    created_at      timestamptz NOT NULL DEFAULT now(),
    last_used_at    timestamptz,
    expires_at      timestamptz,
    log_prompts     boolean,                        -- NULL = inherit from tenant
    metadata        jsonb NOT NULL DEFAULT '{}'::jsonb
);

CREATE INDEX idx_api_keys_prefix ON gateway.api_keys(prefix) WHERE status = 'active';
CREATE INDEX idx_api_keys_tenant ON gateway.api_keys(tenant_id);

CREATE TABLE gateway.key_limits (
    key_id              uuid PRIMARY KEY REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
    rpm                 integer,            -- NULL = inherit tenant
    tpm                 integer,
    concurrent          integer,
    tokens_daily        bigint,
    tokens_monthly      bigint,
    tokens_total        bigint,
    allowed_models      text[],             -- NULL = inherit tenant
    allow_all_models    boolean             -- NULL = inherit tenant
);

CREATE TABLE gateway.budget_usage (
    key_id          uuid NOT NULL REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
    period          gateway.budget_period NOT NULL,
    period_start    timestamptz NOT NULL,
    tokens_in       bigint NOT NULL DEFAULT 0,
    tokens_out      bigint NOT NULL DEFAULT 0,
    requests        bigint NOT NULL DEFAULT 0,
    PRIMARY KEY (key_id, period, period_start)
);

CREATE INDEX idx_budget_usage_period ON gateway.budget_usage(period, period_start);

CREATE TABLE gateway.audit_log (
    id              bigserial PRIMARY KEY,
    ts              timestamptz NOT NULL DEFAULT now(),
    request_id      uuid NOT NULL,
    tenant_id       uuid,                          -- nullable for auth-failed rows
    key_id          uuid,
    key_prefix      text,                          -- denormalized for forensic queries
    method          text NOT NULL,
    path            text NOT NULL,
    model           text,
    tokens_in       integer,
    tokens_out      integer,
    latency_ms      integer,
    status          integer NOT NULL,
    client_ip       inet,
    user_agent      text,
    error_code      text
);

CREATE INDEX idx_audit_ts ON gateway.audit_log(ts);
CREATE INDEX idx_audit_tenant_ts ON gateway.audit_log(tenant_id, ts);
CREATE INDEX idx_audit_key_ts ON gateway.audit_log(key_id, ts);

CREATE TABLE gateway.prompt_log (
    id              bigserial PRIMARY KEY,
    audit_id        bigint NOT NULL REFERENCES gateway.audit_log(id) ON DELETE CASCADE,
    ts              timestamptz NOT NULL DEFAULT now(),
    key_id          uuid NOT NULL,
    request_body    jsonb NOT NULL,
    response_text   text,
    retention_until timestamptz NOT NULL
);

CREATE INDEX idx_prompt_log_retention ON gateway.prompt_log(retention_until);

CREATE TABLE gateway.revocations (
    id              bigserial PRIMARY KEY,
    key_id          uuid NOT NULL,
    ts              timestamptz NOT NULL DEFAULT now(),
    reason          text,
    processed_at    timestamptz
);

-- Trigger to NOTIFY on revocation insert
CREATE OR REPLACE FUNCTION gateway.notify_key_revoked() RETURNS trigger AS $$
BEGIN
    PERFORM pg_notify('key_revoked', NEW.key_id::text);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_notify_key_revoked
    AFTER INSERT ON gateway.revocations
    FOR EACH ROW EXECUTE FUNCTION gateway.notify_key_revoked();

-- Grants for console role (created in console SPEC, referenced here)
-- GRANT USAGE ON SCHEMA gateway TO console_role;
-- GRANT SELECT ON ALL TABLES IN SCHEMA gateway TO console_role;
-- GRANT INSERT ON gateway.revocations TO console_role;
```

## 6. API Surface

### 6.1 Native Ollama passthrough (allowlisted)

| Path | Method | Notes |
|---|---|---|
| `/api/chat` | POST | Streamed (NDJSON) and non-streamed |
| `/api/generate` | POST | Streamed (NDJSON) and non-streamed |
| `/api/embeddings` | POST | Non-streamed |
| `/api/embed` | POST | Newer Ollama embeddings endpoint |
| `/api/tags` | GET | Returns the tenant's **effective** model set (live-discovered ∩ allowed, or *all* discovered when `allow_all_models`). Sourced from discovery (§4.6), never a static list |
| `/api/show` | POST | Allowed only for models in the tenant's effective set; returns sanitized model info (no system prompts, no template) |
| `/api/ps` | GET | **Blocked** — leaks loaded models |
| `/api/version` | GET | Returns gateway version, not Ollama version |

### 6.2 Hard-blocked Ollama endpoints (always 403)

`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`, `/api/blobs/*`

### 6.3 OpenAI-compatible

| Path | Method | Maps to |
|---|---|---|
| `/v1/chat/completions` | POST | `/api/chat` |
| `/v1/completions` | POST | `/api/generate` |
| `/v1/embeddings` | POST | `/api/embed` |
| `/v1/models` | GET | `/api/tags` (the tenant's effective discovered set), in OpenAI model-list format |

Translation must preserve streaming. SSE (`data: {...}\n\n`) for OpenAI-compat; NDJSON for native.

### 6.4 Gateway endpoints

| Path | Method | Auth | Purpose |
|---|---|---|---|
| `/healthz` | GET | none | Liveness — process responsive |
| `/readyz` | GET | none | Readiness — DB + Redis + Ollama all reachable |
| `/metrics` | GET | none (loopback only) | Prometheus exposition (counters, histograms) |

No admin endpoints. Admin lives in `neuronetz-console`.

### 6.5 Response headers

Every proxied response carries:
- `X-Request-ID: <uuid>`
- `X-RateLimit-Limit-Requests: <n>`
- `X-RateLimit-Remaining-Requests: <n>`
- `X-RateLimit-Limit-Tokens: <n>`
- `X-RateLimit-Remaining-Tokens: <n>`
- `X-Budget-Period: day|month|total`
- `X-Budget-Tokens-Remaining: <n>`

429 responses additionally carry `Retry-After: <seconds>`.

## 7. Configuration

All via environment variables, validated by Pydantic Settings on boot. Boot fails loudly on invalid config.

```
# Service
GATEWAY_BIND_HOST=0.0.0.0
GATEWAY_BIND_PORT=8080
GATEWAY_LOG_LEVEL=INFO
GATEWAY_LOG_FORMAT=json                  # json|console
GATEWAY_REQUEST_ID_HEADER=X-Request-ID
GATEWAY_TRUSTED_PROXIES=127.0.0.1,caddy  # for X-Forwarded-For

# Upstream
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_CONNECT_TIMEOUT_S=5
OLLAMA_READ_TIMEOUT_S=600
OLLAMA_MAX_CONNECTIONS=64

# Model discovery (§4.6)
MODEL_DISCOVERY_REFRESH_S=60             # how often to re-query Ollama /api/tags
MODEL_DISCOVERY_CACHE_TTL_S=120          # Redis cache TTL for the discovered model set

# Database
DATABASE_URL=postgresql+asyncpg://gateway:...@postgres:5432/neuronetz
DATABASE_POOL_SIZE=10
DATABASE_POOL_OVERFLOW=20

# Redis
REDIS_URL=redis://redis:6379/0
REDIS_KEY_CACHE_TTL_S=60

# Limits (defaults; per-tenant/key overrides in DB)
DEFAULT_RPM=60
DEFAULT_TPM=100000
DEFAULT_CONCURRENT=8
MAX_REQUEST_BODY_BYTES=262144
MAX_NUM_PREDICT=4096

# Security
ARGON2_TIME_COST=3
ARGON2_MEMORY_COST_KIB=65536
ARGON2_PARALLELISM=4
AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN=20

# Audit
AUDIT_BUFFER_SIZE=1000
PROMPT_LOG_DEFAULT_RETENTION_DAYS=30
AUDIT_LOG_DEFAULT_RETENTION_DAYS=365
```

## 8. Repository Layout

```
neuronetz-gateway/
├── pyproject.toml                # uv-managed, ruff, mypy --strict, pytest
├── README.md
├── LICENSE                       # Apache 2.0
├── docker-compose.yml            # full stack incl. console placeholder
├── docker-compose.dev.yml        # without caddy, gateway exposed on localhost
├── Dockerfile                    # multi-stage, python:3.12-slim base
├── .env.example
├── .dockerignore
├── .gitignore
├── alembic.ini
├── alembic/
│   ├── env.py
│   └── versions/
│       └── 0001_initial.py       # creates schema `gateway` and all tables
├── ops/
│   ├── caddy/
│   │   └── Caddyfile.example
│   └── systemd/
│       └── neuronetz-gateway.service
├── src/neuronetz_gateway/
│   ├── __init__.py
│   ├── __main__.py               # uvicorn entry
│   ├── app.py                    # FastAPI factory
│   ├── config.py                 # Pydantic Settings
│   ├── deps.py                   # DI providers
│   ├── lifespan.py               # startup/shutdown, NOTIFY listener
│   ├── errors.py                 # exception types, handlers, sanitization
│   ├── auth/
│   │   ├── __init__.py
│   │   ├── hashing.py            # argon2id wrapper
│   │   ├── keys.py               # key generation, prefix, verify
│   │   └── middleware.py
│   ├── ratelimit/
│   │   ├── __init__.py
│   │   ├── sliding_window.py     # Redis Lua script
│   │   └── concurrency.py        # semaphore via Redis
│   ├── budget/
│   │   ├── __init__.py
│   │   ├── counter.py            # Redis period counters
│   │   └── ledger.py             # Postgres reconciliation
│   ├── proxy/
│   │   ├── __init__.py
│   │   ├── ollama.py             # httpx streaming client
│   │   ├── translate.py          # OpenAI <-> Ollama schemas
│   │   ├── token_counter.py      # parse usage from stream
│   │   ├── discovery.py          # live model discovery from Ollama /api/tags (§4.6)
│   │   └── allowlist.py          # effective-set resolution (allow_all / allowed ∩ discovered)
│   ├── routes/
│   │   ├── __init__.py
│   │   ├── ollama_native.py
│   │   ├── openai_compat.py
│   │   └── health.py
│   ├── db/
│   │   ├── __init__.py
│   │   ├── session.py
│   │   ├── models.py             # SQLAlchemy 2.0
│   │   └── repositories.py
│   ├── audit/
│   │   ├── __init__.py
│   │   ├── writer.py             # buffered async writer
│   │   └── prompt_log.py
│   ├── observability/
│   │   ├── __init__.py
│   │   ├── logging.py            # structlog config
│   │   └── metrics.py            # prometheus
│   └── cli/
│       ├── __init__.py
│       └── manage.py             # typer: create-tenant, create-key, ...
├── tests/
│   ├── conftest.py               # testcontainers fixtures
│   ├── unit/
│   │   ├── test_hashing.py
│   │   ├── test_translate.py
│   │   ├── test_token_counter.py
│   │   ├── test_discovery.py
│   │   ├── test_allowlist.py
│   │   └── test_sliding_window.py
│   ├── integration/
│   │   ├── test_auth_flow.py
│   │   ├── test_rate_limit.py
│   │   ├── test_budget.py
│   │   ├── test_proxy_stream.py
│   │   ├── test_openai_compat.py
│   │   ├── test_revocation.py
│   │   └── mock_ollama.py        # FastAPI mock with NDJSON/SSE
│   └── load/
│       └── locustfile.py
└── docs/
    ├── ARCHITECTURE.md
    ├── DEPLOYMENT.md
    ├── API.md
    ├── THREAT_MODEL.md
    └── OPERATIONS.md              # runbook: revoke key, rotate, check usage
```

## 9. Non-Functional Requirements

- **Performance:** p50 overhead < 5 ms over direct Ollama call (auth + ratelimit + audit); p99 < 25 ms (excluding upstream latency)
- **Streaming:** Time-to-first-byte must not be degraded by gateway logic — audit write happens **after** stream close
- **Memory:** Steady-state RSS < 200 MiB per gateway worker under 100 concurrent streams
- **Concurrency:** Handle 200 concurrent connections per worker; 4 workers per instance default
- **Test coverage:** ≥ 85% line coverage on `src/neuronetz_gateway/` excluding `__main__` and CLI; 100% on `auth/`, `ratelimit/`, `budget/`
- **Security:** No `eval`, no `exec`, no shell-out, no `pickle`. Bandit clean. `pip-audit` clean on every CI run.
- **Type safety:** `mypy --strict` clean
- **Lint:** `ruff check` clean with project ruleset (E, F, I, B, UP, S, ASYNC)

## 10. Tooling

- Python 3.12
- `uv` for dependency management (pyproject.toml + uv.lock)
- FastAPI ≥ 0.115, uvicorn[standard], httpx ≥ 0.27, SQLAlchemy 2.0 (async), asyncpg, redis ≥ 5.0 (with hiredis), structlog, pydantic ≥ 2.9, pydantic-settings, argon2-cffi, typer, prometheus-client
- Test: pytest, pytest-asyncio, pytest-cov, testcontainers, httpx (test client), respx (mock), locust
- Lint/format: ruff, mypy --strict, bandit, pip-audit
- CI: GitHub Actions workflow (lint, type, test with coverage, build image, push on tag)

## 11. Bootstrap CLI (Typer)

```
neuronetz-gateway create-tenant --name "acme" [--rpm 60] [--tpm 100000]
neuronetz-gateway create-key --tenant acme --name "prod-server-1" [--scopes chat,embeddings]
neuronetz-gateway revoke-key --prefix nz_abc12345
neuronetz-gateway list-keys --tenant acme
neuronetz-gateway show-usage --tenant acme [--period day|month|total]
neuronetz-gateway set-budget --key nz_abc12345 --daily 1000000 --monthly 30000000
neuronetz-gateway set-models --tenant acme --models llama3.1:8b,mistral:7b
neuronetz-gateway set-models --tenant acme --allow-all          # opt into allow_all_models
neuronetz-gateway set-models --tenant acme --no-allow-all       # back to explicit allowlist
neuronetz-gateway list-models [--tenant acme]                   # show live-discovered models
                                                                # (and the tenant's effective set)
```

`create-tenant` accepts `--allow-all-models / --no-allow-all-models` (default off).
`list-models` reads the discovery cache (§4.6); with `--tenant` it also shows that tenant's
resolved effective set.

Key format: `nz_<12-char-prefix><32-char-random>`. Prefix is stored; full key is hashed (argon2id). On creation, the full key is printed exactly once.

## 12. Acceptance Criteria

The build is "done" when every box below is checked. The orchestrator must verify each before declaring v0.1.0.

- [ ] `docker compose up` from a clean checkout produces a running stack with TLS via Caddy (self-signed in dev, Let's Encrypt-ready in prod).
- [ ] CLI creates tenant and key; printed key successfully authenticates an `/api/chat` call.
- [ ] Unauthenticated request returns 401 with no Ollama details leaked.
- [ ] Request to `/api/pull` returns 403 with generic error message.
- [ ] Streaming `/api/chat` works end-to-end; first byte arrives within Ollama's own TTFB + < 10 ms gateway overhead.
- [ ] Streaming `/v1/chat/completions` returns valid SSE with `data: [DONE]` terminator.
- [ ] Token counts in audit log match Ollama's reported `prompt_eval_count` + `eval_count` exactly.
- [ ] `/api/tags` and `/v1/models` reflect the **live** Ollama model set (discovery, §4.6): an `allow_all_models` tenant sees every installed model and a newly-pulled model appears within one refresh interval; a default-deny tenant sees only `allowed_models ∩ discovered`; a request for a model outside the effective set returns a generic 403; with discovery unavailable, requests fail closed (deny), not open.
- [ ] Rate limit triggers at configured RPM with `Retry-After` header.
- [ ] Token budget enforces and blocks at zero remaining with descriptive error.
- [ ] Redis outage causes 503 (fail-closed), not 200.
- [ ] Revocation via `INSERT INTO gateway.revocations` evicts Redis cache within 1 second.
- [ ] `mypy --strict`, `ruff check`, `bandit`, `pip-audit` all clean in CI.
- [ ] Test coverage ≥ 85% overall, 100% in `auth/`, `ratelimit/`, `budget/`.
- [ ] `docs/THREAT_MODEL.md`, `docs/DEPLOYMENT.md`, `docs/OPERATIONS.md` present and accurate.
- [ ] Load test (locust): 100 concurrent users sustained 5 minutes, p99 gateway overhead < 25 ms, zero 5xx outside induced failures.

## 13. Open Questions (decide during build)

1. Embedding cost accounting — Ollama doesn't return `eval_count` for embeddings. Decision: charge based on `prompt_eval_count` only; document as such.
2. SSE vs NDJSON heuristic for OpenAI-compat — always SSE per OpenAI spec. NDJSON only on native `/api/*`.
3. Prometheus cardinality — do not label by `key_id` (too many series); label by `tenant_id` only; per-key data lives in Postgres.
4. **Model discovery source** — the live model list is `GET /api/tags` on the Ollama backend; there is no separate registry. Cached in Redis + in-process, refreshed every `MODEL_DISCOVERY_REFRESH_S`.
5. **Discovery failure is fail-closed** — empty/expired discovered set ⇒ no model resolves ⇒ deny. Discovery never opens access on error.
6. **No existence disclosure** — a model that is installed-but-unpermitted and a model that is not installed both return the same generic response, to prevent enumeration.
7. **`allow_all_models` precedence** — key-level `allow_all_models` (when non-NULL) overrides the tenant flag; otherwise the tenant flag applies. Same NULL-inherits-tenant rule as the other key limits.

## 14. References

- Ollama API: https://github.com/ollama/ollama/blob/main/docs/api.md
- OpenAI Chat Completions: https://platform.openai.com/docs/api-reference/chat
- Nibiru (sibling console project): https://nibiru-framework.com
- Argon2 RFC 9106