Files
Stephan Berbig d79f17b3bb scaffold: project skeleton, schema, healthz/readyz, CI
Initial project structure for neuronetz-gateway per scope-docs/SPEC.md:

- Python 3.12 / FastAPI / SQLAlchemy 2.0 (async) / Redis / Postgres stack
  managed by uv. Multi-stage non-root Dockerfile, prod + dev compose files
  (ollama service is NEVER published in either), Caddyfile + systemd unit,
  justfile, GitHub Actions CI (ruff, mypy --strict, pytest, bandit, pip-audit).
- Pydantic-Settings config covering every env var from SPEC §7, including the
  MODEL_DISCOVERY_* keys for the dynamic-discovery feature (§4.6).
- Alembic 0001_initial creates the full gateway schema (8 tables, 3 enums,
  notify_key_revoked() trigger), incl. allow_all_models on tenant_limits and
  key_limits for the per-tenant auto-grant toggle.
- Working /healthz, /readyz (fail-closed when deps unreachable), and a
  Prometheus /metrics stub. Sanitizing error handlers that attach X-Request-ID
  to every response and never leak upstream internals.
- SPEC + AGENT_PROMPT included under scope-docs/ (source of truth).
2026-05-26 20:50:35 +02:00

594 lines
29 KiB
Markdown

# neuronetz-gateway — SPEC.md
**Project:** `neuronetz-gateway`
**Version:** 0.1.0 (target)
**Status:** Specification — not yet implemented
**License:** Apache 2.0
**Owner:** Stephan Berbig / Neuronetz
---
## 1. Purpose
A secure, multi-tenant API gateway in front of an Ollama instance currently exposed at `https://api.neuronetz.ai`. The Ollama endpoint must never be reachable directly from the public internet again. All access flows through this gateway.
The gateway is the **hot path** of the Neuronetz API. A separate service (`neuronetz-console`, built on the Nibiru PHP framework) handles administration, dashboards, and tenant self-service. This SPEC covers only the gateway.
## 2. Scope
### In scope (v0.1.0)
- Authentication via API keys (Bearer tokens)
- Multi-tenant data model (tenants → keys, with inheritance)
- Per-key and per-tenant rate limiting (RPM, TPM, concurrent)
- Per-key and per-tenant token budgets (daily, monthly, total)
- Streaming and non-streaming proxy to Ollama
- Dual API surface: native Ollama (`/api/*`) and OpenAI-compatible (`/v1/*`)
- Endpoint allowlist (block all model-mutating Ollama endpoints)
- **Dynamic model discovery** from the Ollama backend — the live set of installed models is queried, cached, and auto-refreshed; nothing about the model list is hand-maintained
- Model allowlist (per-tenant override), **default-deny, resolved against the live discovered set** (stale/typo'd entries never resolve)
- **Per-tenant `allow_all_models` toggle** — opt-in: a flagged tenant may use any currently-installed model, so models newly pulled into Ollama are auto-granted on the next discovery refresh
- Request size limits, response size limits, timeouts
- Token counting from Ollama responses (precise, not heuristic)
- Audit log (always-on metadata)
- Prompt log (opt-in per key, TTL'd retention)
- Bootstrap CLI: create tenants, keys, set budgets
- Health and readiness endpoints
- Docker Compose deployment (gateway + caddy + postgres + redis + ollama)
- Caddy as TLS terminator (Let's Encrypt for `api.neuronetz.ai`)
### Out of scope (v0.1.0, document as future)
- Web admin UI (lives in `neuronetz-console`, separate repo)
- Billing / Stripe integration (budgets only, no money yet)
- Multi-region / HA / k8s
- Content moderation / prompt-injection filtering
- Response caching
- Multi-backend routing (one Ollama; pluggable backend interface stays for later)
- Webhook notifications
- SSO / OAuth2 for admin
## 3. Threat Model (abbreviated)
| Threat | Mitigation |
|---|---|
| Internet scanners hitting Ollama directly | Ollama bound to internal Docker network; never published |
| Unauthenticated API abuse | Mandatory Bearer token; fail-closed on auth errors |
| API key brute force | Argon2id hashing; constant-time compare; rate limit on auth failures per source IP |
| GPU/token exhaustion (cost attack) | Per-key TPM + token budget; per-tenant ceiling; concurrent connection cap |
| Resource exhaustion via large payloads | Request body size limit (default 256 KiB); `num_predict` cap (default 4096) |
| Model enumeration / training-data exfil via uncommon models | Model allowlist; default-deny. `allow_all_models` is **opt-in per tenant and audited**. Discovery only ever exposes models actually installed on the backend; `/api/tags` and `/v1/models` never reveal models outside the tenant's effective set; "not allowed" and "doesn't exist" return the same generic response |
| Discovery backend unreachable | Fail-closed: an empty/stale-expired discovered set means no model resolves, so requests are denied — never "allow because we couldn't list models" |
| Ollama mutation (model pull/delete) by attacker | Endpoint allowlist; mutating endpoints (`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`) hard-blocked at the gateway |
| Information disclosure via error messages | Sanitize upstream errors; never proxy Ollama internals to client |
| Audit log tampering | Append-only at app layer; DB role separation; optional WAL archiving |
| Prompt data leakage | Prompt logging off by default; opt-in per key; TTL'd; redaction hook |
| Redis outage causing "fail open" | Fail-closed: if rate-limit/budget backend is unavailable, deny |
| Compromised admin token | Admin token lives in `neuronetz-console`, not in gateway; gateway has no admin endpoints |
## 4. Architecture
### 4.1 Component diagram
```
Internet
│ TLS
┌──────────────────────┐
│ Caddy (sidecar) │ Let's Encrypt for api.neuronetz.ai
│ - TLS termination │ HSTS, security headers
│ - HTTP/2, HTTP/3 │
└──────────┬───────────┘
│ HTTP/1.1 internal
┌──────────▼───────────┐
│ neuronetz-gateway │ FastAPI + uvicorn
│ - authn │
│ - rate limit │
│ - budget check │
│ - proxy + stream │
│ - token count │
│ - audit write │
└──┬────────┬──────┬───┘
│ │ │
┌──────▼──┐ ┌──▼───┐ │
│Postgres │ │Redis │ │
│ schema: │ │ keys │ │
│ gateway │ │bucket│ │
└─────────┘ └──────┘ │
│ internal network only
┌──────▼──────┐
│ Ollama │
│ 127.0.0.1 │
└─────────────┘
Same Compose stack also hosts (separate from this SPEC):
- neuronetz-console (PHP/Nibiru) → reads schema `console`, reads schema `gateway` (SELECT)
```
### 4.2 Database schemas
**Single Postgres instance, two schemas:**
- `gateway` — owned by the gateway service; gateway role has full DDL
- `console` — owned by `neuronetz-console` (out of scope here); console role has full DDL
- Both services connect with their own role. Cross-schema access is explicit GRANT.
**Console role gets `SELECT` on all `gateway.*` tables.** Console writes go only to `console.*` tables. If the console needs to mutate gateway state (e.g. revoke a key), it does so by writing to a `gateway.revocations` outbox table that the gateway tails (see §4.5).
### 4.3 Request lifecycle
1. Caddy terminates TLS, forwards to gateway on internal port.
2. Gateway middleware extracts `Authorization: Bearer <key>`.
3. Key prefix (first 12 chars) used as Redis cache key. On miss, lookup `gateway.api_keys` by prefix; verify full key with argon2id `verify`; cache resolved key metadata in Redis (TTL 60s).
4. Rate limit check (sliding window in Redis, Lua-atomic) — per-key RPM + per-tenant RPM.
5. Budget check (Redis counter for current period; Postgres ledger is source of truth on reset).
6. Concurrent-connection semaphore (Redis `INCR` with TTL).
7. Model allowlist check. Resolve the **effective model set** for the key:
`allow_all := key.allow_all_models ?? tenant.allow_all_models`;
`effective := discovered` if `allow_all` else `(key.allowed_models ?? tenant.allowed_models) ∩ discovered`,
where `discovered` is the cached live model set from discovery (§4.6). The request's
`model` must be in `effective`, else a generic 403 with no disclosure of whether the
model exists but is unpermitted vs. is not installed.
8. Endpoint allowlist check.
9. Request body validation (size, schema, `num_predict` cap).
10. If OpenAI-compat path, translate request to Ollama schema.
11. Open httpx async stream to Ollama.
12. Stream response back to client, accumulating final `prompt_eval_count` + `eval_count`.
13. On stream close: write `gateway.audit_log` row; decrement budget; release semaphore; if prompt logging enabled, write `gateway.prompt_log` row.
14. On any failure: sanitized error to client, audit row with status code, semaphore released.
### 4.4 Failure modes (fail-closed)
| Subsystem | If down | Behavior |
|---|---|---|
| Postgres (read) | Key lookup fails | 503 with retry-after; no requests proxied |
| Postgres (write) | Audit write fails | Request still succeeds, audit row buffered in-memory ring (max 1000), drained on recovery; if buffer fills, switch to deny mode |
| Redis | Rate limit / budget unavailable | 503 — fail closed. Never "allow because we can't check." |
| Ollama | Upstream unreachable | 502 with retry-after; circuit breaker opens after 5 consecutive failures, half-open after 30s |
| Caddy | Not a gateway concern | — |
### 4.5 Cache invalidation (key revocation)
Console can revoke a key by inserting into `gateway.revocations(key_id, ts, reason)`. Gateway has a background task (`asyncio.create_task` in lifespan) that:
- LISTENs on Postgres channel `key_revoked` (gateway emits NOTIFY on its own write path; console emits via INSERT trigger)
- On notification, evicts the Redis cache entry for that key's prefix
- This makes revocation effectively immediate (≤ Redis RTT) without cross-service HTTP
### 4.6 Model discovery
The set of usable models is **never hand-maintained**; it is extracted live from the
Ollama backend.
- A background task (started in lifespan, like the revocation listener) polls Ollama
`GET /api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds.
- The parsed model set (names + sanitized metadata: family, parameter size, quantization,
size bytes, modified-at) is cached in Redis under `gateway:models:discovered` with TTL
`MODEL_DISCOVERY_CACHE_TTL_S`, and held in-process for hot reads on the request path.
- On startup an initial fetch runs; if Ollama is unreachable the discovered set is empty.
- **Fail-closed:** if the discovered set is empty or its cache has expired and cannot be
refreshed, no model resolves and requests are denied (consistent with default-deny).
Discovery never opens access on failure.
- "Auto-grant": because the effective set (§4.3 step 7) intersects with `discovered` (or
*is* `discovered` when `allow_all_models`), a model pulled into Ollama out-of-band
becomes usable to `allow_all` tenants on the next refresh — no per-tenant config change.
- Discovery is **read-only** against Ollama and uses only the allowlisted `/api/tags`
endpoint; it never triggers a model pull.
## 5. Data Model (schema `gateway`)
```sql
CREATE SCHEMA gateway;
CREATE TYPE gateway.key_status AS ENUM ('active', 'disabled', 'revoked');
CREATE TYPE gateway.tenant_status AS ENUM ('active', 'suspended', 'closed');
CREATE TYPE gateway.budget_period AS ENUM ('day', 'month', 'total');
CREATE TABLE gateway.tenants (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
name text NOT NULL UNIQUE,
status gateway.tenant_status NOT NULL DEFAULT 'active',
created_at timestamptz NOT NULL DEFAULT now(),
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
CREATE TABLE gateway.tenant_limits (
tenant_id uuid PRIMARY KEY REFERENCES gateway.tenants(id) ON DELETE CASCADE,
rpm integer NOT NULL DEFAULT 60,
tpm integer NOT NULL DEFAULT 100000,
concurrent integer NOT NULL DEFAULT 8,
tokens_daily bigint,
tokens_monthly bigint,
tokens_total bigint,
allowed_models text[] NOT NULL DEFAULT '{}',
allow_all_models boolean NOT NULL DEFAULT false, -- opt-in: allow any installed model
log_prompts_default boolean NOT NULL DEFAULT false,
prompt_retention_days integer NOT NULL DEFAULT 30,
audit_retention_days integer NOT NULL DEFAULT 365
);
CREATE TABLE gateway.api_keys (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id uuid NOT NULL REFERENCES gateway.tenants(id) ON DELETE CASCADE,
prefix text NOT NULL UNIQUE, -- first 12 chars, indexed
key_hash text NOT NULL, -- argon2id
name text NOT NULL,
status gateway.key_status NOT NULL DEFAULT 'active',
scopes text[] NOT NULL DEFAULT '{chat,embeddings}',
created_at timestamptz NOT NULL DEFAULT now(),
last_used_at timestamptz,
expires_at timestamptz,
log_prompts boolean, -- NULL = inherit from tenant
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
CREATE INDEX idx_api_keys_prefix ON gateway.api_keys(prefix) WHERE status = 'active';
CREATE INDEX idx_api_keys_tenant ON gateway.api_keys(tenant_id);
CREATE TABLE gateway.key_limits (
key_id uuid PRIMARY KEY REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
rpm integer, -- NULL = inherit tenant
tpm integer,
concurrent integer,
tokens_daily bigint,
tokens_monthly bigint,
tokens_total bigint,
allowed_models text[], -- NULL = inherit tenant
allow_all_models boolean -- NULL = inherit tenant
);
CREATE TABLE gateway.budget_usage (
key_id uuid NOT NULL REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
period gateway.budget_period NOT NULL,
period_start timestamptz NOT NULL,
tokens_in bigint NOT NULL DEFAULT 0,
tokens_out bigint NOT NULL DEFAULT 0,
requests bigint NOT NULL DEFAULT 0,
PRIMARY KEY (key_id, period, period_start)
);
CREATE INDEX idx_budget_usage_period ON gateway.budget_usage(period, period_start);
CREATE TABLE gateway.audit_log (
id bigserial PRIMARY KEY,
ts timestamptz NOT NULL DEFAULT now(),
request_id uuid NOT NULL,
tenant_id uuid, -- nullable for auth-failed rows
key_id uuid,
key_prefix text, -- denormalized for forensic queries
method text NOT NULL,
path text NOT NULL,
model text,
tokens_in integer,
tokens_out integer,
latency_ms integer,
status integer NOT NULL,
client_ip inet,
user_agent text,
error_code text
);
CREATE INDEX idx_audit_ts ON gateway.audit_log(ts);
CREATE INDEX idx_audit_tenant_ts ON gateway.audit_log(tenant_id, ts);
CREATE INDEX idx_audit_key_ts ON gateway.audit_log(key_id, ts);
CREATE TABLE gateway.prompt_log (
id bigserial PRIMARY KEY,
audit_id bigint NOT NULL REFERENCES gateway.audit_log(id) ON DELETE CASCADE,
ts timestamptz NOT NULL DEFAULT now(),
key_id uuid NOT NULL,
request_body jsonb NOT NULL,
response_text text,
retention_until timestamptz NOT NULL
);
CREATE INDEX idx_prompt_log_retention ON gateway.prompt_log(retention_until);
CREATE TABLE gateway.revocations (
id bigserial PRIMARY KEY,
key_id uuid NOT NULL,
ts timestamptz NOT NULL DEFAULT now(),
reason text,
processed_at timestamptz
);
-- Trigger to NOTIFY on revocation insert
CREATE OR REPLACE FUNCTION gateway.notify_key_revoked() RETURNS trigger AS $$
BEGIN
PERFORM pg_notify('key_revoked', NEW.key_id::text);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_notify_key_revoked
AFTER INSERT ON gateway.revocations
FOR EACH ROW EXECUTE FUNCTION gateway.notify_key_revoked();
-- Grants for console role (created in console SPEC, referenced here)
-- GRANT USAGE ON SCHEMA gateway TO console_role;
-- GRANT SELECT ON ALL TABLES IN SCHEMA gateway TO console_role;
-- GRANT INSERT ON gateway.revocations TO console_role;
```
## 6. API Surface
### 6.1 Native Ollama passthrough (allowlisted)
| Path | Method | Notes |
|---|---|---|
| `/api/chat` | POST | Streamed (NDJSON) and non-streamed |
| `/api/generate` | POST | Streamed (NDJSON) and non-streamed |
| `/api/embeddings` | POST | Non-streamed |
| `/api/embed` | POST | Newer Ollama embeddings endpoint |
| `/api/tags` | GET | Returns the tenant's **effective** model set (live-discovered ∩ allowed, or *all* discovered when `allow_all_models`). Sourced from discovery (§4.6), never a static list |
| `/api/show` | POST | Allowed only for models in the tenant's effective set; returns sanitized model info (no system prompts, no template) |
| `/api/ps` | GET | **Blocked** — leaks loaded models |
| `/api/version` | GET | Returns gateway version, not Ollama version |
### 6.2 Hard-blocked Ollama endpoints (always 403)
`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`, `/api/blobs/*`
### 6.3 OpenAI-compatible
| Path | Method | Maps to |
|---|---|---|
| `/v1/chat/completions` | POST | `/api/chat` |
| `/v1/completions` | POST | `/api/generate` |
| `/v1/embeddings` | POST | `/api/embed` |
| `/v1/models` | GET | `/api/tags` (the tenant's effective discovered set), in OpenAI model-list format |
Translation must preserve streaming. SSE (`data: {...}\n\n`) for OpenAI-compat; NDJSON for native.
### 6.4 Gateway endpoints
| Path | Method | Auth | Purpose |
|---|---|---|---|
| `/healthz` | GET | none | Liveness — process responsive |
| `/readyz` | GET | none | Readiness — DB + Redis + Ollama all reachable |
| `/metrics` | GET | none (loopback only) | Prometheus exposition (counters, histograms) |
No admin endpoints. Admin lives in `neuronetz-console`.
### 6.5 Response headers
Every proxied response carries:
- `X-Request-ID: <uuid>`
- `X-RateLimit-Limit-Requests: <n>`
- `X-RateLimit-Remaining-Requests: <n>`
- `X-RateLimit-Limit-Tokens: <n>`
- `X-RateLimit-Remaining-Tokens: <n>`
- `X-Budget-Period: day|month|total`
- `X-Budget-Tokens-Remaining: <n>`
429 responses additionally carry `Retry-After: <seconds>`.
## 7. Configuration
All via environment variables, validated by Pydantic Settings on boot. Boot fails loudly on invalid config.
```
# Service
GATEWAY_BIND_HOST=0.0.0.0
GATEWAY_BIND_PORT=8080
GATEWAY_LOG_LEVEL=INFO
GATEWAY_LOG_FORMAT=json # json|console
GATEWAY_REQUEST_ID_HEADER=X-Request-ID
GATEWAY_TRUSTED_PROXIES=127.0.0.1,caddy # for X-Forwarded-For
# Upstream
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_CONNECT_TIMEOUT_S=5
OLLAMA_READ_TIMEOUT_S=600
OLLAMA_MAX_CONNECTIONS=64
# Model discovery (§4.6)
MODEL_DISCOVERY_REFRESH_S=60 # how often to re-query Ollama /api/tags
MODEL_DISCOVERY_CACHE_TTL_S=120 # Redis cache TTL for the discovered model set
# Database
DATABASE_URL=postgresql+asyncpg://gateway:...@postgres:5432/neuronetz
DATABASE_POOL_SIZE=10
DATABASE_POOL_OVERFLOW=20
# Redis
REDIS_URL=redis://redis:6379/0
REDIS_KEY_CACHE_TTL_S=60
# Limits (defaults; per-tenant/key overrides in DB)
DEFAULT_RPM=60
DEFAULT_TPM=100000
DEFAULT_CONCURRENT=8
MAX_REQUEST_BODY_BYTES=262144
MAX_NUM_PREDICT=4096
# Security
ARGON2_TIME_COST=3
ARGON2_MEMORY_COST_KIB=65536
ARGON2_PARALLELISM=4
AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN=20
# Audit
AUDIT_BUFFER_SIZE=1000
PROMPT_LOG_DEFAULT_RETENTION_DAYS=30
AUDIT_LOG_DEFAULT_RETENTION_DAYS=365
```
## 8. Repository Layout
```
neuronetz-gateway/
├── pyproject.toml # uv-managed, ruff, mypy --strict, pytest
├── README.md
├── LICENSE # Apache 2.0
├── docker-compose.yml # full stack incl. console placeholder
├── docker-compose.dev.yml # without caddy, gateway exposed on localhost
├── Dockerfile # multi-stage, python:3.12-slim base
├── .env.example
├── .dockerignore
├── .gitignore
├── alembic.ini
├── alembic/
│ ├── env.py
│ └── versions/
│ └── 0001_initial.py # creates schema `gateway` and all tables
├── ops/
│ ├── caddy/
│ │ └── Caddyfile.example
│ └── systemd/
│ └── neuronetz-gateway.service
├── src/neuronetz_gateway/
│ ├── __init__.py
│ ├── __main__.py # uvicorn entry
│ ├── app.py # FastAPI factory
│ ├── config.py # Pydantic Settings
│ ├── deps.py # DI providers
│ ├── lifespan.py # startup/shutdown, NOTIFY listener
│ ├── errors.py # exception types, handlers, sanitization
│ ├── auth/
│ │ ├── __init__.py
│ │ ├── hashing.py # argon2id wrapper
│ │ ├── keys.py # key generation, prefix, verify
│ │ └── middleware.py
│ ├── ratelimit/
│ │ ├── __init__.py
│ │ ├── sliding_window.py # Redis Lua script
│ │ └── concurrency.py # semaphore via Redis
│ ├── budget/
│ │ ├── __init__.py
│ │ ├── counter.py # Redis period counters
│ │ └── ledger.py # Postgres reconciliation
│ ├── proxy/
│ │ ├── __init__.py
│ │ ├── ollama.py # httpx streaming client
│ │ ├── translate.py # OpenAI <-> Ollama schemas
│ │ ├── token_counter.py # parse usage from stream
│ │ ├── discovery.py # live model discovery from Ollama /api/tags (§4.6)
│ │ └── allowlist.py # effective-set resolution (allow_all / allowed ∩ discovered)
│ ├── routes/
│ │ ├── __init__.py
│ │ ├── ollama_native.py
│ │ ├── openai_compat.py
│ │ └── health.py
│ ├── db/
│ │ ├── __init__.py
│ │ ├── session.py
│ │ ├── models.py # SQLAlchemy 2.0
│ │ └── repositories.py
│ ├── audit/
│ │ ├── __init__.py
│ │ ├── writer.py # buffered async writer
│ │ └── prompt_log.py
│ ├── observability/
│ │ ├── __init__.py
│ │ ├── logging.py # structlog config
│ │ └── metrics.py # prometheus
│ └── cli/
│ ├── __init__.py
│ └── manage.py # typer: create-tenant, create-key, ...
├── tests/
│ ├── conftest.py # testcontainers fixtures
│ ├── unit/
│ │ ├── test_hashing.py
│ │ ├── test_translate.py
│ │ ├── test_token_counter.py
│ │ ├── test_discovery.py
│ │ ├── test_allowlist.py
│ │ └── test_sliding_window.py
│ ├── integration/
│ │ ├── test_auth_flow.py
│ │ ├── test_rate_limit.py
│ │ ├── test_budget.py
│ │ ├── test_proxy_stream.py
│ │ ├── test_openai_compat.py
│ │ ├── test_revocation.py
│ │ └── mock_ollama.py # FastAPI mock with NDJSON/SSE
│ └── load/
│ └── locustfile.py
└── docs/
├── ARCHITECTURE.md
├── DEPLOYMENT.md
├── API.md
├── THREAT_MODEL.md
└── OPERATIONS.md # runbook: revoke key, rotate, check usage
```
## 9. Non-Functional Requirements
- **Performance:** p50 overhead < 5 ms over direct Ollama call (auth + ratelimit + audit); p99 < 25 ms (excluding upstream latency)
- **Streaming:** Time-to-first-byte must not be degraded by gateway logic — audit write happens **after** stream close
- **Memory:** Steady-state RSS < 200 MiB per gateway worker under 100 concurrent streams
- **Concurrency:** Handle 200 concurrent connections per worker; 4 workers per instance default
- **Test coverage:** ≥ 85% line coverage on `src/neuronetz_gateway/` excluding `__main__` and CLI; 100% on `auth/`, `ratelimit/`, `budget/`
- **Security:** No `eval`, no `exec`, no shell-out, no `pickle`. Bandit clean. `pip-audit` clean on every CI run.
- **Type safety:** `mypy --strict` clean
- **Lint:** `ruff check` clean with project ruleset (E, F, I, B, UP, S, ASYNC)
## 10. Tooling
- Python 3.12
- `uv` for dependency management (pyproject.toml + uv.lock)
- FastAPI ≥ 0.115, uvicorn[standard], httpx ≥ 0.27, SQLAlchemy 2.0 (async), asyncpg, redis ≥ 5.0 (with hiredis), structlog, pydantic ≥ 2.9, pydantic-settings, argon2-cffi, typer, prometheus-client
- Test: pytest, pytest-asyncio, pytest-cov, testcontainers, httpx (test client), respx (mock), locust
- Lint/format: ruff, mypy --strict, bandit, pip-audit
- CI: GitHub Actions workflow (lint, type, test with coverage, build image, push on tag)
## 11. Bootstrap CLI (Typer)
```
neuronetz-gateway create-tenant --name "acme" [--rpm 60] [--tpm 100000]
neuronetz-gateway create-key --tenant acme --name "prod-server-1" [--scopes chat,embeddings]
neuronetz-gateway revoke-key --prefix nz_abc12345
neuronetz-gateway list-keys --tenant acme
neuronetz-gateway show-usage --tenant acme [--period day|month|total]
neuronetz-gateway set-budget --key nz_abc12345 --daily 1000000 --monthly 30000000
neuronetz-gateway set-models --tenant acme --models llama3.1:8b,mistral:7b
neuronetz-gateway set-models --tenant acme --allow-all # opt into allow_all_models
neuronetz-gateway set-models --tenant acme --no-allow-all # back to explicit allowlist
neuronetz-gateway list-models [--tenant acme] # show live-discovered models
# (and the tenant's effective set)
```
`create-tenant` accepts `--allow-all-models / --no-allow-all-models` (default off).
`list-models` reads the discovery cache (§4.6); with `--tenant` it also shows that tenant's
resolved effective set.
Key format: `nz_<12-char-prefix><32-char-random>`. Prefix is stored; full key is hashed (argon2id). On creation, the full key is printed exactly once.
## 12. Acceptance Criteria
The build is "done" when every box below is checked. The orchestrator must verify each before declaring v0.1.0.
- [ ] `docker compose up` from a clean checkout produces a running stack with TLS via Caddy (self-signed in dev, Let's Encrypt-ready in prod).
- [ ] CLI creates tenant and key; printed key successfully authenticates an `/api/chat` call.
- [ ] Unauthenticated request returns 401 with no Ollama details leaked.
- [ ] Request to `/api/pull` returns 403 with generic error message.
- [ ] Streaming `/api/chat` works end-to-end; first byte arrives within Ollama's own TTFB + < 10 ms gateway overhead.
- [ ] Streaming `/v1/chat/completions` returns valid SSE with `data: [DONE]` terminator.
- [ ] Token counts in audit log match Ollama's reported `prompt_eval_count` + `eval_count` exactly.
- [ ] `/api/tags` and `/v1/models` reflect the **live** Ollama model set (discovery, §4.6): an `allow_all_models` tenant sees every installed model and a newly-pulled model appears within one refresh interval; a default-deny tenant sees only `allowed_models ∩ discovered`; a request for a model outside the effective set returns a generic 403; with discovery unavailable, requests fail closed (deny), not open.
- [ ] Rate limit triggers at configured RPM with `Retry-After` header.
- [ ] Token budget enforces and blocks at zero remaining with descriptive error.
- [ ] Redis outage causes 503 (fail-closed), not 200.
- [ ] Revocation via `INSERT INTO gateway.revocations` evicts Redis cache within 1 second.
- [ ] `mypy --strict`, `ruff check`, `bandit`, `pip-audit` all clean in CI.
- [ ] Test coverage ≥ 85% overall, 100% in `auth/`, `ratelimit/`, `budget/`.
- [ ] `docs/THREAT_MODEL.md`, `docs/DEPLOYMENT.md`, `docs/OPERATIONS.md` present and accurate.
- [ ] Load test (locust): 100 concurrent users sustained 5 minutes, p99 gateway overhead < 25 ms, zero 5xx outside induced failures.
## 13. Open Questions (decide during build)
1. Embedding cost accounting — Ollama doesn't return `eval_count` for embeddings. Decision: charge based on `prompt_eval_count` only; document as such.
2. SSE vs NDJSON heuristic for OpenAI-compat — always SSE per OpenAI spec. NDJSON only on native `/api/*`.
3. Prometheus cardinality — do not label by `key_id` (too many series); label by `tenant_id` only; per-key data lives in Postgres.
4. **Model discovery source** — the live model list is `GET /api/tags` on the Ollama backend; there is no separate registry. Cached in Redis + in-process, refreshed every `MODEL_DISCOVERY_REFRESH_S`.
5. **Discovery failure is fail-closed** — empty/expired discovered set ⇒ no model resolves ⇒ deny. Discovery never opens access on error.
6. **No existence disclosure** — a model that is installed-but-unpermitted and a model that is not installed both return the same generic response, to prevent enumeration.
7. **`allow_all_models` precedence** — key-level `allow_all_models` (when non-NULL) overrides the tenant flag; otherwise the tenant flag applies. Same NULL-inherits-tenant rule as the other key limits.
## 14. References
- Ollama API: https://github.com/ollama/ollama/blob/main/docs/api.md
- OpenAI Chat Completions: https://platform.openai.com/docs/api-reference/chat
- Nibiru (sibling console project): https://nibiru-framework.com
- Argon2 RFC 9106