scaffold: project skeleton, schema, healthz/readyz, CI

Initial project structure for neuronetz-gateway per scope-docs/SPEC.md:

- Python 3.12 / FastAPI / SQLAlchemy 2.0 (async) / Redis / Postgres stack
  managed by uv. Multi-stage non-root Dockerfile, prod + dev compose files
  (ollama service is NEVER published in either), Caddyfile + systemd unit,
  justfile, GitHub Actions CI (ruff, mypy --strict, pytest, bandit, pip-audit).
- Pydantic-Settings config covering every env var from SPEC §7, including the
  MODEL_DISCOVERY_* keys for the dynamic-discovery feature (§4.6).
- Alembic 0001_initial creates the full gateway schema (8 tables, 3 enums,
  notify_key_revoked() trigger), incl. allow_all_models on tenant_limits and
  key_limits for the per-tenant auto-grant toggle.
- Working /healthz, /readyz (fail-closed when deps unreachable), and a
  Prometheus /metrics stub. Sanitizing error handlers that attach X-Request-ID
  to every response and never leak upstream internals.
- SPEC + AGENT_PROMPT included under scope-docs/ (source of truth).
This commit is contained in:
Stephan Berbig
2026-05-26 20:50:35 +02:00
commit d79f17b3bb
32 changed files with 3610 additions and 0 deletions

121
scope-docs/AGENT_PROMPT.md Normal file
View File

@@ -0,0 +1,121 @@
# Build Order: neuronetz-gateway v0.1.0
## Context
The Ollama instance at `https://api.neuronetz.ai` is currently exposed without authentication. This is a security incident in waiting. Your job is to build the gateway that closes that gap and forms the commercial API surface of the Neuronetz AI platform.
The full specification is in **`SPEC.md`** in this repository. Read it before writing any code. It is the source of truth; if anything below conflicts with it, SPEC.md wins.
## Mission
Implement `neuronetz-gateway` per SPEC.md to a state that satisfies **§12 Acceptance Criteria**. Nothing less ships.
## Non-Negotiables
These are hard constraints. Violating any of them is a build failure regardless of feature completeness.
1. **Fail closed, always.** If a security or budgeting check cannot be performed (Redis down, DB unreachable, ambiguous state), deny the request. Never default to allow.
2. **Ollama never reachable from outside the Docker internal network.** No `ports:` mapping for the ollama service in any compose file shipped with the project. Document this prominently.
3. **No secrets in code, no secrets in logs, no secrets in errors.** Argon2id for key storage. Constant-time comparison only. Keys printed exactly once at creation.
4. **No reflected upstream errors.** Ollama errors are sanitized at the gateway boundary. Map to generic 4xx/5xx with a request ID.
5. **Mutating Ollama endpoints (`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`, `/api/blobs/*`) are hard-blocked.** Not configurable. Not behind a feature flag. Blocked.
6. **Streaming integrity.** Token counting and audit writes happen **after** stream close, never on the hot path. Time-to-first-byte must not be degraded by gateway bookkeeping.
7. **`mypy --strict` and `ruff check` clean before any PR is opened.** No `# type: ignore` without an inline justification comment.
8. **Test coverage targets (§9) are a gate, not a goal.** 100% on `auth/`, `ratelimit/`, `budget/`. CI fails below threshold.
9. **Apache 2.0 license file present from commit one.** No GPL dependencies.
10. **The bootstrap CLI must work before the first manual `curl`.** No "I'll create a key by hand in the DB just to test it" — if the CLI can't create a key, fix the CLI first.
## Phasing
Five phases. Each phase has an explicit exit criterion. **Do not start phase N+1 until phase N's exit criterion is verifiably met.** PM/Control: enforce this.
### Phase 1 — Scaffold
- Repo layout per SPEC §8
- `pyproject.toml`, `uv.lock`, Dockerfile, docker-compose.yml, docker-compose.dev.yml, .env.example, README, LICENSE
- Alembic configured; migration `0001_initial.py` creates schema `gateway` and all tables per SPEC §5
- `make` or `just` targets: `dev`, `test`, `lint`, `typecheck`, `migrate`, `compose-up`, `compose-down`
- CI workflow runs: ruff, mypy, pytest, bandit, pip-audit
- **Exit criterion:** `docker compose -f docker-compose.dev.yml up` brings up postgres + redis + a stub gateway that responds 200 on `/healthz` and 503 on `/readyz` (because no Ollama yet). Migrations apply cleanly. CI is green on an empty test suite.
### Phase 2 — Core proxy + auth
- Bootstrap CLI (`create-tenant`, `create-key`, `list-keys`, `revoke-key`) working end-to-end
- Argon2id hashing module with unit tests covering: hash, verify, constant-time behavior, rehash-on-parameter-change
- Auth middleware: Bearer extraction, prefix lookup, hash verify, Redis cache with TTL
- Ollama proxy for `/api/chat` and `/api/generate` — both streamed (NDJSON) and non-streamed
- Endpoint allowlist enforced
- **Model discovery (SPEC §4.6):** background poll of Ollama `/api/tags`, cached in Redis + in-process, fail-closed when unavailable
- Model allowlist enforced per-tenant via the **effective set** (allow_all → all discovered; else `allowed_models ∩ discovered`); key-level `allow_all_models` overrides tenant
- Error handler: sanitized responses, request ID in every error
- Audit log writer (buffered, async)
- Mock Ollama in `tests/integration/mock_ollama.py` (no real model required for CI)
- **Exit criterion:** A key created via CLI can call `/api/chat` and `/api/generate` through Caddy → gateway → mock Ollama, streaming works, audit rows land in Postgres with correct token counts, `/api/pull` returns 403, no-auth returns 401, wrong-key returns 401. Model discovery populates from the (mock) Ollama `/api/tags`; `/api/tags` returns the tenant's effective set; an `allow_all_models` tenant sees all discovered models, a default-deny tenant sees only `allowed ∩ discovered`, and a non-effective model returns 403; discovery-unavailable fails closed. Integration tests cover all of the above.
### Phase 3 — Rate limit + budget + OpenAI-compat
- Sliding window rate limit (Redis Lua script) — per-key RPM, per-tenant RPM, per-key TPM
- Concurrency semaphore (Redis-backed) with TTL guard
- Token budget counters in Redis with Postgres ledger reconciliation on period rollover
- OpenAI-compatibility layer: `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/models` with full SSE streaming and `data: [DONE]` terminator
- Schema translation tests with golden fixtures (request in OpenAI → expected Ollama request; response from Ollama → expected OpenAI response)
- Rate-limit and budget response headers per SPEC §6.5
- **Exit criterion:** Locust test (100 concurrent users, 5 min) shows correct 429 behavior at the limit, correct token accounting, p99 gateway overhead < 25 ms. OpenAI Python SDK pointed at `/v1` successfully completes streaming chat. Killing Redis mid-test produces 503 (fail closed), not 200.
### Phase 4 — Audit, prompt log, revocation
- Prompt log (opt-in per key, TTL) with daily sweeper task
- Audit log retention sweeper (TTL per tenant config)
- Buffered audit writer with ring-buffer overflow → deny-mode behavior
- Revocation flow: console (simulated via direct INSERT in tests) writes `gateway.revocations` → NOTIFY → gateway evicts Redis cache → next request with revoked key returns 401 within 1 second
- Prometheus `/metrics` (loopback only) with: `gateway_requests_total{tenant,model,status}`, `gateway_tokens_total{tenant,model,direction}`, `gateway_request_duration_seconds{tenant,model}` (histogram)
- `/readyz` checks DB + Redis + Ollama all reachable
- Circuit breaker on Ollama failures
- **Exit criterion:** Revocation E2E test green. Prompt log retention TTL works (use freeze-time to simulate). Metrics scrape returns valid Prometheus exposition. `/readyz` flips to 503 when any dependency is down.
### Phase 5 — Harden, document, release
- `docs/ARCHITECTURE.md`, `docs/DEPLOYMENT.md`, `docs/API.md`, `docs/THREAT_MODEL.md`, `docs/OPERATIONS.md` complete
- Caddyfile example with Let's Encrypt for `api.neuronetz.ai` and security headers (HSTS, X-Content-Type-Options, no Server header, no X-Powered-By)
- Systemd unit file for non-Compose deployments
- Multi-stage Dockerfile with non-root user, distroless or `python:3.12-slim` final stage, no build tools in final image
- `pip-audit` and `bandit` clean in CI
- Image scan (Trivy or Grype) clean of HIGH/CRITICAL
- Tag `v0.1.0`, build and push image, GitHub release with changelog
- **Exit criterion:** Every box in SPEC §12 checked, signed off by Control. Image runnable from a fresh host with only docker + a `.env`. README quickstart works for someone who has never seen the repo.
## Agent Role Assignments
For the multi-agent orchestrator (Fritz/UI-UX/DevOps/QA/Control/Timo/PM):
| Agent | Owns |
|---|---|
| **Backend / Fritz** | All Python code under `src/neuronetz_gateway/`, Alembic migrations, CLI. Primary author. |
| **DevOps** | Dockerfile, docker-compose.yml(s), Caddyfile, systemd unit, CI workflows, image scanning, release tagging. |
| **QA** | All tests under `tests/`. Owns coverage gate. Writes the locust scenarios. Verifies acceptance criteria at each phase exit. |
| **UI-UX** | Not active this project (no UI surface here). Console project will pick this up. |
| **Control / Timo** | Enforces phase gates. Refuses to advance a phase whose exit criterion isn't met. Runs the acceptance checklist at end of Phase 5. |
| **PM** | Tracks the phase progression, opens YouTrack tickets per phase, runs daily standups against this prompt, surfaces blockers. |
## Working Agreements
- **Branch per phase.** `phase-1-scaffold`, `phase-2-proxy-auth`, etc. Merge to `main` only after phase exit criterion is verified.
- **PRs are reviewed against SPEC.md.** "Does this match the spec? If not, is SPEC.md wrong or is the PR wrong?" — that's the review question.
- **SPEC changes are explicit.** If a phase reveals a spec mistake, amend SPEC.md in a separate PR before changing the implementation. Never drift silently.
- **Commit messages reference the section.** e.g. `auth: implement argon2id verify per SPEC §5, §9`.
- **No TODOs in main.** If something is deferred, it becomes a tracked issue, not a code comment.
- **Open questions (SPEC §13) are resolved in writing.** Decision goes in SPEC.md, not in a Slack message that gets lost.
## What "Done" Looks Like
A fresh clone, a fresh host, a domain pointing at it, and a `.env` file. `docker compose up`. Five minutes later, `curl -H "Authorization: Bearer nz_..." https://api.neuronetz.ai/v1/chat/completions -d '...'` streams a response. The Ollama port is not open. The audit log has a row. The budget counter decremented. The metrics endpoint shows the request. The locust suite passes. The threat model document explains every defense.
When all of that is true and SPEC §12 is fully ticked, ship v0.1.0.
## When You Get Stuck
- **Ambiguity in the spec → ask, don't guess.** Open a question in the PM channel; if resolved, amend SPEC.md.
- **Conflict between speed and correctness → correctness wins.** This is security infrastructure. We do not ship "good enough."
- **Conflict between scope creep and v0.1.0 → defer.** New ideas go in a follow-up issue. v0.1.0 ships per spec.
Start with Phase 1. Read SPEC.md first.

593
scope-docs/SPEC.md Normal file
View File

@@ -0,0 +1,593 @@
# neuronetz-gateway — SPEC.md
**Project:** `neuronetz-gateway`
**Version:** 0.1.0 (target)
**Status:** Specification — not yet implemented
**License:** Apache 2.0
**Owner:** Stephan Berbig / Neuronetz
---
## 1. Purpose
A secure, multi-tenant API gateway in front of an Ollama instance currently exposed at `https://api.neuronetz.ai`. The Ollama endpoint must never be reachable directly from the public internet again. All access flows through this gateway.
The gateway is the **hot path** of the Neuronetz API. A separate service (`neuronetz-console`, built on the Nibiru PHP framework) handles administration, dashboards, and tenant self-service. This SPEC covers only the gateway.
## 2. Scope
### In scope (v0.1.0)
- Authentication via API keys (Bearer tokens)
- Multi-tenant data model (tenants → keys, with inheritance)
- Per-key and per-tenant rate limiting (RPM, TPM, concurrent)
- Per-key and per-tenant token budgets (daily, monthly, total)
- Streaming and non-streaming proxy to Ollama
- Dual API surface: native Ollama (`/api/*`) and OpenAI-compatible (`/v1/*`)
- Endpoint allowlist (block all model-mutating Ollama endpoints)
- **Dynamic model discovery** from the Ollama backend — the live set of installed models is queried, cached, and auto-refreshed; nothing about the model list is hand-maintained
- Model allowlist (per-tenant override), **default-deny, resolved against the live discovered set** (stale/typo'd entries never resolve)
- **Per-tenant `allow_all_models` toggle** — opt-in: a flagged tenant may use any currently-installed model, so models newly pulled into Ollama are auto-granted on the next discovery refresh
- Request size limits, response size limits, timeouts
- Token counting from Ollama responses (precise, not heuristic)
- Audit log (always-on metadata)
- Prompt log (opt-in per key, TTL'd retention)
- Bootstrap CLI: create tenants, keys, set budgets
- Health and readiness endpoints
- Docker Compose deployment (gateway + caddy + postgres + redis + ollama)
- Caddy as TLS terminator (Let's Encrypt for `api.neuronetz.ai`)
### Out of scope (v0.1.0, document as future)
- Web admin UI (lives in `neuronetz-console`, separate repo)
- Billing / Stripe integration (budgets only, no money yet)
- Multi-region / HA / k8s
- Content moderation / prompt-injection filtering
- Response caching
- Multi-backend routing (one Ollama; pluggable backend interface stays for later)
- Webhook notifications
- SSO / OAuth2 for admin
## 3. Threat Model (abbreviated)
| Threat | Mitigation |
|---|---|
| Internet scanners hitting Ollama directly | Ollama bound to internal Docker network; never published |
| Unauthenticated API abuse | Mandatory Bearer token; fail-closed on auth errors |
| API key brute force | Argon2id hashing; constant-time compare; rate limit on auth failures per source IP |
| GPU/token exhaustion (cost attack) | Per-key TPM + token budget; per-tenant ceiling; concurrent connection cap |
| Resource exhaustion via large payloads | Request body size limit (default 256 KiB); `num_predict` cap (default 4096) |
| Model enumeration / training-data exfil via uncommon models | Model allowlist; default-deny. `allow_all_models` is **opt-in per tenant and audited**. Discovery only ever exposes models actually installed on the backend; `/api/tags` and `/v1/models` never reveal models outside the tenant's effective set; "not allowed" and "doesn't exist" return the same generic response |
| Discovery backend unreachable | Fail-closed: an empty/stale-expired discovered set means no model resolves, so requests are denied — never "allow because we couldn't list models" |
| Ollama mutation (model pull/delete) by attacker | Endpoint allowlist; mutating endpoints (`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`) hard-blocked at the gateway |
| Information disclosure via error messages | Sanitize upstream errors; never proxy Ollama internals to client |
| Audit log tampering | Append-only at app layer; DB role separation; optional WAL archiving |
| Prompt data leakage | Prompt logging off by default; opt-in per key; TTL'd; redaction hook |
| Redis outage causing "fail open" | Fail-closed: if rate-limit/budget backend is unavailable, deny |
| Compromised admin token | Admin token lives in `neuronetz-console`, not in gateway; gateway has no admin endpoints |
## 4. Architecture
### 4.1 Component diagram
```
Internet
│ TLS
┌──────────────────────┐
│ Caddy (sidecar) │ Let's Encrypt for api.neuronetz.ai
│ - TLS termination │ HSTS, security headers
│ - HTTP/2, HTTP/3 │
└──────────┬───────────┘
│ HTTP/1.1 internal
┌──────────▼───────────┐
│ neuronetz-gateway │ FastAPI + uvicorn
│ - authn │
│ - rate limit │
│ - budget check │
│ - proxy + stream │
│ - token count │
│ - audit write │
└──┬────────┬──────┬───┘
│ │ │
┌──────▼──┐ ┌──▼───┐ │
│Postgres │ │Redis │ │
│ schema: │ │ keys │ │
│ gateway │ │bucket│ │
└─────────┘ └──────┘ │
│ internal network only
┌──────▼──────┐
│ Ollama │
│ 127.0.0.1 │
└─────────────┘
Same Compose stack also hosts (separate from this SPEC):
- neuronetz-console (PHP/Nibiru) → reads schema `console`, reads schema `gateway` (SELECT)
```
### 4.2 Database schemas
**Single Postgres instance, two schemas:**
- `gateway` — owned by the gateway service; gateway role has full DDL
- `console` — owned by `neuronetz-console` (out of scope here); console role has full DDL
- Both services connect with their own role. Cross-schema access is explicit GRANT.
**Console role gets `SELECT` on all `gateway.*` tables.** Console writes go only to `console.*` tables. If the console needs to mutate gateway state (e.g. revoke a key), it does so by writing to a `gateway.revocations` outbox table that the gateway tails (see §4.5).
### 4.3 Request lifecycle
1. Caddy terminates TLS, forwards to gateway on internal port.
2. Gateway middleware extracts `Authorization: Bearer <key>`.
3. Key prefix (first 12 chars) used as Redis cache key. On miss, lookup `gateway.api_keys` by prefix; verify full key with argon2id `verify`; cache resolved key metadata in Redis (TTL 60s).
4. Rate limit check (sliding window in Redis, Lua-atomic) — per-key RPM + per-tenant RPM.
5. Budget check (Redis counter for current period; Postgres ledger is source of truth on reset).
6. Concurrent-connection semaphore (Redis `INCR` with TTL).
7. Model allowlist check. Resolve the **effective model set** for the key:
`allow_all := key.allow_all_models ?? tenant.allow_all_models`;
`effective := discovered` if `allow_all` else `(key.allowed_models ?? tenant.allowed_models) ∩ discovered`,
where `discovered` is the cached live model set from discovery (§4.6). The request's
`model` must be in `effective`, else a generic 403 with no disclosure of whether the
model exists but is unpermitted vs. is not installed.
8. Endpoint allowlist check.
9. Request body validation (size, schema, `num_predict` cap).
10. If OpenAI-compat path, translate request to Ollama schema.
11. Open httpx async stream to Ollama.
12. Stream response back to client, accumulating final `prompt_eval_count` + `eval_count`.
13. On stream close: write `gateway.audit_log` row; decrement budget; release semaphore; if prompt logging enabled, write `gateway.prompt_log` row.
14. On any failure: sanitized error to client, audit row with status code, semaphore released.
### 4.4 Failure modes (fail-closed)
| Subsystem | If down | Behavior |
|---|---|---|
| Postgres (read) | Key lookup fails | 503 with retry-after; no requests proxied |
| Postgres (write) | Audit write fails | Request still succeeds, audit row buffered in-memory ring (max 1000), drained on recovery; if buffer fills, switch to deny mode |
| Redis | Rate limit / budget unavailable | 503 — fail closed. Never "allow because we can't check." |
| Ollama | Upstream unreachable | 502 with retry-after; circuit breaker opens after 5 consecutive failures, half-open after 30s |
| Caddy | Not a gateway concern | — |
### 4.5 Cache invalidation (key revocation)
Console can revoke a key by inserting into `gateway.revocations(key_id, ts, reason)`. Gateway has a background task (`asyncio.create_task` in lifespan) that:
- LISTENs on Postgres channel `key_revoked` (gateway emits NOTIFY on its own write path; console emits via INSERT trigger)
- On notification, evicts the Redis cache entry for that key's prefix
- This makes revocation effectively immediate (≤ Redis RTT) without cross-service HTTP
### 4.6 Model discovery
The set of usable models is **never hand-maintained**; it is extracted live from the
Ollama backend.
- A background task (started in lifespan, like the revocation listener) polls Ollama
`GET /api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds.
- The parsed model set (names + sanitized metadata: family, parameter size, quantization,
size bytes, modified-at) is cached in Redis under `gateway:models:discovered` with TTL
`MODEL_DISCOVERY_CACHE_TTL_S`, and held in-process for hot reads on the request path.
- On startup an initial fetch runs; if Ollama is unreachable the discovered set is empty.
- **Fail-closed:** if the discovered set is empty or its cache has expired and cannot be
refreshed, no model resolves and requests are denied (consistent with default-deny).
Discovery never opens access on failure.
- "Auto-grant": because the effective set (§4.3 step 7) intersects with `discovered` (or
*is* `discovered` when `allow_all_models`), a model pulled into Ollama out-of-band
becomes usable to `allow_all` tenants on the next refresh — no per-tenant config change.
- Discovery is **read-only** against Ollama and uses only the allowlisted `/api/tags`
endpoint; it never triggers a model pull.
## 5. Data Model (schema `gateway`)
```sql
CREATE SCHEMA gateway;
CREATE TYPE gateway.key_status AS ENUM ('active', 'disabled', 'revoked');
CREATE TYPE gateway.tenant_status AS ENUM ('active', 'suspended', 'closed');
CREATE TYPE gateway.budget_period AS ENUM ('day', 'month', 'total');
CREATE TABLE gateway.tenants (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
name text NOT NULL UNIQUE,
status gateway.tenant_status NOT NULL DEFAULT 'active',
created_at timestamptz NOT NULL DEFAULT now(),
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
CREATE TABLE gateway.tenant_limits (
tenant_id uuid PRIMARY KEY REFERENCES gateway.tenants(id) ON DELETE CASCADE,
rpm integer NOT NULL DEFAULT 60,
tpm integer NOT NULL DEFAULT 100000,
concurrent integer NOT NULL DEFAULT 8,
tokens_daily bigint,
tokens_monthly bigint,
tokens_total bigint,
allowed_models text[] NOT NULL DEFAULT '{}',
allow_all_models boolean NOT NULL DEFAULT false, -- opt-in: allow any installed model
log_prompts_default boolean NOT NULL DEFAULT false,
prompt_retention_days integer NOT NULL DEFAULT 30,
audit_retention_days integer NOT NULL DEFAULT 365
);
CREATE TABLE gateway.api_keys (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id uuid NOT NULL REFERENCES gateway.tenants(id) ON DELETE CASCADE,
prefix text NOT NULL UNIQUE, -- first 12 chars, indexed
key_hash text NOT NULL, -- argon2id
name text NOT NULL,
status gateway.key_status NOT NULL DEFAULT 'active',
scopes text[] NOT NULL DEFAULT '{chat,embeddings}',
created_at timestamptz NOT NULL DEFAULT now(),
last_used_at timestamptz,
expires_at timestamptz,
log_prompts boolean, -- NULL = inherit from tenant
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
CREATE INDEX idx_api_keys_prefix ON gateway.api_keys(prefix) WHERE status = 'active';
CREATE INDEX idx_api_keys_tenant ON gateway.api_keys(tenant_id);
CREATE TABLE gateway.key_limits (
key_id uuid PRIMARY KEY REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
rpm integer, -- NULL = inherit tenant
tpm integer,
concurrent integer,
tokens_daily bigint,
tokens_monthly bigint,
tokens_total bigint,
allowed_models text[], -- NULL = inherit tenant
allow_all_models boolean -- NULL = inherit tenant
);
CREATE TABLE gateway.budget_usage (
key_id uuid NOT NULL REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
period gateway.budget_period NOT NULL,
period_start timestamptz NOT NULL,
tokens_in bigint NOT NULL DEFAULT 0,
tokens_out bigint NOT NULL DEFAULT 0,
requests bigint NOT NULL DEFAULT 0,
PRIMARY KEY (key_id, period, period_start)
);
CREATE INDEX idx_budget_usage_period ON gateway.budget_usage(period, period_start);
CREATE TABLE gateway.audit_log (
id bigserial PRIMARY KEY,
ts timestamptz NOT NULL DEFAULT now(),
request_id uuid NOT NULL,
tenant_id uuid, -- nullable for auth-failed rows
key_id uuid,
key_prefix text, -- denormalized for forensic queries
method text NOT NULL,
path text NOT NULL,
model text,
tokens_in integer,
tokens_out integer,
latency_ms integer,
status integer NOT NULL,
client_ip inet,
user_agent text,
error_code text
);
CREATE INDEX idx_audit_ts ON gateway.audit_log(ts);
CREATE INDEX idx_audit_tenant_ts ON gateway.audit_log(tenant_id, ts);
CREATE INDEX idx_audit_key_ts ON gateway.audit_log(key_id, ts);
CREATE TABLE gateway.prompt_log (
id bigserial PRIMARY KEY,
audit_id bigint NOT NULL REFERENCES gateway.audit_log(id) ON DELETE CASCADE,
ts timestamptz NOT NULL DEFAULT now(),
key_id uuid NOT NULL,
request_body jsonb NOT NULL,
response_text text,
retention_until timestamptz NOT NULL
);
CREATE INDEX idx_prompt_log_retention ON gateway.prompt_log(retention_until);
CREATE TABLE gateway.revocations (
id bigserial PRIMARY KEY,
key_id uuid NOT NULL,
ts timestamptz NOT NULL DEFAULT now(),
reason text,
processed_at timestamptz
);
-- Trigger to NOTIFY on revocation insert
CREATE OR REPLACE FUNCTION gateway.notify_key_revoked() RETURNS trigger AS $$
BEGIN
PERFORM pg_notify('key_revoked', NEW.key_id::text);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_notify_key_revoked
AFTER INSERT ON gateway.revocations
FOR EACH ROW EXECUTE FUNCTION gateway.notify_key_revoked();
-- Grants for console role (created in console SPEC, referenced here)
-- GRANT USAGE ON SCHEMA gateway TO console_role;
-- GRANT SELECT ON ALL TABLES IN SCHEMA gateway TO console_role;
-- GRANT INSERT ON gateway.revocations TO console_role;
```
## 6. API Surface
### 6.1 Native Ollama passthrough (allowlisted)
| Path | Method | Notes |
|---|---|---|
| `/api/chat` | POST | Streamed (NDJSON) and non-streamed |
| `/api/generate` | POST | Streamed (NDJSON) and non-streamed |
| `/api/embeddings` | POST | Non-streamed |
| `/api/embed` | POST | Newer Ollama embeddings endpoint |
| `/api/tags` | GET | Returns the tenant's **effective** model set (live-discovered ∩ allowed, or *all* discovered when `allow_all_models`). Sourced from discovery (§4.6), never a static list |
| `/api/show` | POST | Allowed only for models in the tenant's effective set; returns sanitized model info (no system prompts, no template) |
| `/api/ps` | GET | **Blocked** — leaks loaded models |
| `/api/version` | GET | Returns gateway version, not Ollama version |
### 6.2 Hard-blocked Ollama endpoints (always 403)
`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`, `/api/blobs/*`
### 6.3 OpenAI-compatible
| Path | Method | Maps to |
|---|---|---|
| `/v1/chat/completions` | POST | `/api/chat` |
| `/v1/completions` | POST | `/api/generate` |
| `/v1/embeddings` | POST | `/api/embed` |
| `/v1/models` | GET | `/api/tags` (the tenant's effective discovered set), in OpenAI model-list format |
Translation must preserve streaming. SSE (`data: {...}\n\n`) for OpenAI-compat; NDJSON for native.
### 6.4 Gateway endpoints
| Path | Method | Auth | Purpose |
|---|---|---|---|
| `/healthz` | GET | none | Liveness — process responsive |
| `/readyz` | GET | none | Readiness — DB + Redis + Ollama all reachable |
| `/metrics` | GET | none (loopback only) | Prometheus exposition (counters, histograms) |
No admin endpoints. Admin lives in `neuronetz-console`.
### 6.5 Response headers
Every proxied response carries:
- `X-Request-ID: <uuid>`
- `X-RateLimit-Limit-Requests: <n>`
- `X-RateLimit-Remaining-Requests: <n>`
- `X-RateLimit-Limit-Tokens: <n>`
- `X-RateLimit-Remaining-Tokens: <n>`
- `X-Budget-Period: day|month|total`
- `X-Budget-Tokens-Remaining: <n>`
429 responses additionally carry `Retry-After: <seconds>`.
## 7. Configuration
All via environment variables, validated by Pydantic Settings on boot. Boot fails loudly on invalid config.
```
# Service
GATEWAY_BIND_HOST=0.0.0.0
GATEWAY_BIND_PORT=8080
GATEWAY_LOG_LEVEL=INFO
GATEWAY_LOG_FORMAT=json # json|console
GATEWAY_REQUEST_ID_HEADER=X-Request-ID
GATEWAY_TRUSTED_PROXIES=127.0.0.1,caddy # for X-Forwarded-For
# Upstream
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_CONNECT_TIMEOUT_S=5
OLLAMA_READ_TIMEOUT_S=600
OLLAMA_MAX_CONNECTIONS=64
# Model discovery (§4.6)
MODEL_DISCOVERY_REFRESH_S=60 # how often to re-query Ollama /api/tags
MODEL_DISCOVERY_CACHE_TTL_S=120 # Redis cache TTL for the discovered model set
# Database
DATABASE_URL=postgresql+asyncpg://gateway:...@postgres:5432/neuronetz
DATABASE_POOL_SIZE=10
DATABASE_POOL_OVERFLOW=20
# Redis
REDIS_URL=redis://redis:6379/0
REDIS_KEY_CACHE_TTL_S=60
# Limits (defaults; per-tenant/key overrides in DB)
DEFAULT_RPM=60
DEFAULT_TPM=100000
DEFAULT_CONCURRENT=8
MAX_REQUEST_BODY_BYTES=262144
MAX_NUM_PREDICT=4096
# Security
ARGON2_TIME_COST=3
ARGON2_MEMORY_COST_KIB=65536
ARGON2_PARALLELISM=4
AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN=20
# Audit
AUDIT_BUFFER_SIZE=1000
PROMPT_LOG_DEFAULT_RETENTION_DAYS=30
AUDIT_LOG_DEFAULT_RETENTION_DAYS=365
```
## 8. Repository Layout
```
neuronetz-gateway/
├── pyproject.toml # uv-managed, ruff, mypy --strict, pytest
├── README.md
├── LICENSE # Apache 2.0
├── docker-compose.yml # full stack incl. console placeholder
├── docker-compose.dev.yml # without caddy, gateway exposed on localhost
├── Dockerfile # multi-stage, python:3.12-slim base
├── .env.example
├── .dockerignore
├── .gitignore
├── alembic.ini
├── alembic/
│ ├── env.py
│ └── versions/
│ └── 0001_initial.py # creates schema `gateway` and all tables
├── ops/
│ ├── caddy/
│ │ └── Caddyfile.example
│ └── systemd/
│ └── neuronetz-gateway.service
├── src/neuronetz_gateway/
│ ├── __init__.py
│ ├── __main__.py # uvicorn entry
│ ├── app.py # FastAPI factory
│ ├── config.py # Pydantic Settings
│ ├── deps.py # DI providers
│ ├── lifespan.py # startup/shutdown, NOTIFY listener
│ ├── errors.py # exception types, handlers, sanitization
│ ├── auth/
│ │ ├── __init__.py
│ │ ├── hashing.py # argon2id wrapper
│ │ ├── keys.py # key generation, prefix, verify
│ │ └── middleware.py
│ ├── ratelimit/
│ │ ├── __init__.py
│ │ ├── sliding_window.py # Redis Lua script
│ │ └── concurrency.py # semaphore via Redis
│ ├── budget/
│ │ ├── __init__.py
│ │ ├── counter.py # Redis period counters
│ │ └── ledger.py # Postgres reconciliation
│ ├── proxy/
│ │ ├── __init__.py
│ │ ├── ollama.py # httpx streaming client
│ │ ├── translate.py # OpenAI <-> Ollama schemas
│ │ ├── token_counter.py # parse usage from stream
│ │ ├── discovery.py # live model discovery from Ollama /api/tags (§4.6)
│ │ └── allowlist.py # effective-set resolution (allow_all / allowed ∩ discovered)
│ ├── routes/
│ │ ├── __init__.py
│ │ ├── ollama_native.py
│ │ ├── openai_compat.py
│ │ └── health.py
│ ├── db/
│ │ ├── __init__.py
│ │ ├── session.py
│ │ ├── models.py # SQLAlchemy 2.0
│ │ └── repositories.py
│ ├── audit/
│ │ ├── __init__.py
│ │ ├── writer.py # buffered async writer
│ │ └── prompt_log.py
│ ├── observability/
│ │ ├── __init__.py
│ │ ├── logging.py # structlog config
│ │ └── metrics.py # prometheus
│ └── cli/
│ ├── __init__.py
│ └── manage.py # typer: create-tenant, create-key, ...
├── tests/
│ ├── conftest.py # testcontainers fixtures
│ ├── unit/
│ │ ├── test_hashing.py
│ │ ├── test_translate.py
│ │ ├── test_token_counter.py
│ │ ├── test_discovery.py
│ │ ├── test_allowlist.py
│ │ └── test_sliding_window.py
│ ├── integration/
│ │ ├── test_auth_flow.py
│ │ ├── test_rate_limit.py
│ │ ├── test_budget.py
│ │ ├── test_proxy_stream.py
│ │ ├── test_openai_compat.py
│ │ ├── test_revocation.py
│ │ └── mock_ollama.py # FastAPI mock with NDJSON/SSE
│ └── load/
│ └── locustfile.py
└── docs/
├── ARCHITECTURE.md
├── DEPLOYMENT.md
├── API.md
├── THREAT_MODEL.md
└── OPERATIONS.md # runbook: revoke key, rotate, check usage
```
## 9. Non-Functional Requirements
- **Performance:** p50 overhead < 5 ms over direct Ollama call (auth + ratelimit + audit); p99 < 25 ms (excluding upstream latency)
- **Streaming:** Time-to-first-byte must not be degraded by gateway logic — audit write happens **after** stream close
- **Memory:** Steady-state RSS < 200 MiB per gateway worker under 100 concurrent streams
- **Concurrency:** Handle 200 concurrent connections per worker; 4 workers per instance default
- **Test coverage:** ≥ 85% line coverage on `src/neuronetz_gateway/` excluding `__main__` and CLI; 100% on `auth/`, `ratelimit/`, `budget/`
- **Security:** No `eval`, no `exec`, no shell-out, no `pickle`. Bandit clean. `pip-audit` clean on every CI run.
- **Type safety:** `mypy --strict` clean
- **Lint:** `ruff check` clean with project ruleset (E, F, I, B, UP, S, ASYNC)
## 10. Tooling
- Python 3.12
- `uv` for dependency management (pyproject.toml + uv.lock)
- FastAPI ≥ 0.115, uvicorn[standard], httpx ≥ 0.27, SQLAlchemy 2.0 (async), asyncpg, redis ≥ 5.0 (with hiredis), structlog, pydantic ≥ 2.9, pydantic-settings, argon2-cffi, typer, prometheus-client
- Test: pytest, pytest-asyncio, pytest-cov, testcontainers, httpx (test client), respx (mock), locust
- Lint/format: ruff, mypy --strict, bandit, pip-audit
- CI: GitHub Actions workflow (lint, type, test with coverage, build image, push on tag)
## 11. Bootstrap CLI (Typer)
```
neuronetz-gateway create-tenant --name "acme" [--rpm 60] [--tpm 100000]
neuronetz-gateway create-key --tenant acme --name "prod-server-1" [--scopes chat,embeddings]
neuronetz-gateway revoke-key --prefix nz_abc12345
neuronetz-gateway list-keys --tenant acme
neuronetz-gateway show-usage --tenant acme [--period day|month|total]
neuronetz-gateway set-budget --key nz_abc12345 --daily 1000000 --monthly 30000000
neuronetz-gateway set-models --tenant acme --models llama3.1:8b,mistral:7b
neuronetz-gateway set-models --tenant acme --allow-all # opt into allow_all_models
neuronetz-gateway set-models --tenant acme --no-allow-all # back to explicit allowlist
neuronetz-gateway list-models [--tenant acme] # show live-discovered models
# (and the tenant's effective set)
```
`create-tenant` accepts `--allow-all-models / --no-allow-all-models` (default off).
`list-models` reads the discovery cache (§4.6); with `--tenant` it also shows that tenant's
resolved effective set.
Key format: `nz_<12-char-prefix><32-char-random>`. Prefix is stored; full key is hashed (argon2id). On creation, the full key is printed exactly once.
## 12. Acceptance Criteria
The build is "done" when every box below is checked. The orchestrator must verify each before declaring v0.1.0.
- [ ] `docker compose up` from a clean checkout produces a running stack with TLS via Caddy (self-signed in dev, Let's Encrypt-ready in prod).
- [ ] CLI creates tenant and key; printed key successfully authenticates an `/api/chat` call.
- [ ] Unauthenticated request returns 401 with no Ollama details leaked.
- [ ] Request to `/api/pull` returns 403 with generic error message.
- [ ] Streaming `/api/chat` works end-to-end; first byte arrives within Ollama's own TTFB + < 10 ms gateway overhead.
- [ ] Streaming `/v1/chat/completions` returns valid SSE with `data: [DONE]` terminator.
- [ ] Token counts in audit log match Ollama's reported `prompt_eval_count` + `eval_count` exactly.
- [ ] `/api/tags` and `/v1/models` reflect the **live** Ollama model set (discovery, §4.6): an `allow_all_models` tenant sees every installed model and a newly-pulled model appears within one refresh interval; a default-deny tenant sees only `allowed_models ∩ discovered`; a request for a model outside the effective set returns a generic 403; with discovery unavailable, requests fail closed (deny), not open.
- [ ] Rate limit triggers at configured RPM with `Retry-After` header.
- [ ] Token budget enforces and blocks at zero remaining with descriptive error.
- [ ] Redis outage causes 503 (fail-closed), not 200.
- [ ] Revocation via `INSERT INTO gateway.revocations` evicts Redis cache within 1 second.
- [ ] `mypy --strict`, `ruff check`, `bandit`, `pip-audit` all clean in CI.
- [ ] Test coverage ≥ 85% overall, 100% in `auth/`, `ratelimit/`, `budget/`.
- [ ] `docs/THREAT_MODEL.md`, `docs/DEPLOYMENT.md`, `docs/OPERATIONS.md` present and accurate.
- [ ] Load test (locust): 100 concurrent users sustained 5 minutes, p99 gateway overhead < 25 ms, zero 5xx outside induced failures.
## 13. Open Questions (decide during build)
1. Embedding cost accounting — Ollama doesn't return `eval_count` for embeddings. Decision: charge based on `prompt_eval_count` only; document as such.
2. SSE vs NDJSON heuristic for OpenAI-compat — always SSE per OpenAI spec. NDJSON only on native `/api/*`.
3. Prometheus cardinality — do not label by `key_id` (too many series); label by `tenant_id` only; per-key data lives in Postgres.
4. **Model discovery source** — the live model list is `GET /api/tags` on the Ollama backend; there is no separate registry. Cached in Redis + in-process, refreshed every `MODEL_DISCOVERY_REFRESH_S`.
5. **Discovery failure is fail-closed** — empty/expired discovered set ⇒ no model resolves ⇒ deny. Discovery never opens access on error.
6. **No existence disclosure** — a model that is installed-but-unpermitted and a model that is not installed both return the same generic response, to prevent enumeration.
7. **`allow_all_models` precedence** — key-level `allow_all_models` (when non-NULL) overrides the tenant flag; otherwise the tenant flag applies. Same NULL-inherits-tenant rule as the other key limits.
## 14. References
- Ollama API: https://github.com/ollama/ollama/blob/main/docs/api.md
- OpenAI Chat Completions: https://platform.openai.com/docs/api-reference/chat
- Nibiru (sibling console project): https://nibiru-framework.com
- Argon2 RFC 9106