scaffold: project skeleton, schema, healthz/readyz, CI
Initial project structure for neuronetz-gateway per scope-docs/SPEC.md: - Python 3.12 / FastAPI / SQLAlchemy 2.0 (async) / Redis / Postgres stack managed by uv. Multi-stage non-root Dockerfile, prod + dev compose files (ollama service is NEVER published in either), Caddyfile + systemd unit, justfile, GitHub Actions CI (ruff, mypy --strict, pytest, bandit, pip-audit). - Pydantic-Settings config covering every env var from SPEC §7, including the MODEL_DISCOVERY_* keys for the dynamic-discovery feature (§4.6). - Alembic 0001_initial creates the full gateway schema (8 tables, 3 enums, notify_key_revoked() trigger), incl. allow_all_models on tenant_limits and key_limits for the per-tenant auto-grant toggle. - Working /healthz, /readyz (fail-closed when deps unreachable), and a Prometheus /metrics stub. Sanitizing error handlers that attach X-Request-ID to every response and never leak upstream internals. - SPEC + AGENT_PROMPT included under scope-docs/ (source of truth).
This commit is contained in:
121
scope-docs/AGENT_PROMPT.md
Normal file
121
scope-docs/AGENT_PROMPT.md
Normal file
@@ -0,0 +1,121 @@
|
||||
# Build Order: neuronetz-gateway v0.1.0
|
||||
|
||||
## Context
|
||||
|
||||
The Ollama instance at `https://api.neuronetz.ai` is currently exposed without authentication. This is a security incident in waiting. Your job is to build the gateway that closes that gap and forms the commercial API surface of the Neuronetz AI platform.
|
||||
|
||||
The full specification is in **`SPEC.md`** in this repository. Read it before writing any code. It is the source of truth; if anything below conflicts with it, SPEC.md wins.
|
||||
|
||||
## Mission
|
||||
|
||||
Implement `neuronetz-gateway` per SPEC.md to a state that satisfies **§12 Acceptance Criteria**. Nothing less ships.
|
||||
|
||||
## Non-Negotiables
|
||||
|
||||
These are hard constraints. Violating any of them is a build failure regardless of feature completeness.
|
||||
|
||||
1. **Fail closed, always.** If a security or budgeting check cannot be performed (Redis down, DB unreachable, ambiguous state), deny the request. Never default to allow.
|
||||
2. **Ollama never reachable from outside the Docker internal network.** No `ports:` mapping for the ollama service in any compose file shipped with the project. Document this prominently.
|
||||
3. **No secrets in code, no secrets in logs, no secrets in errors.** Argon2id for key storage. Constant-time comparison only. Keys printed exactly once at creation.
|
||||
4. **No reflected upstream errors.** Ollama errors are sanitized at the gateway boundary. Map to generic 4xx/5xx with a request ID.
|
||||
5. **Mutating Ollama endpoints (`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`, `/api/blobs/*`) are hard-blocked.** Not configurable. Not behind a feature flag. Blocked.
|
||||
6. **Streaming integrity.** Token counting and audit writes happen **after** stream close, never on the hot path. Time-to-first-byte must not be degraded by gateway bookkeeping.
|
||||
7. **`mypy --strict` and `ruff check` clean before any PR is opened.** No `# type: ignore` without an inline justification comment.
|
||||
8. **Test coverage targets (§9) are a gate, not a goal.** 100% on `auth/`, `ratelimit/`, `budget/`. CI fails below threshold.
|
||||
9. **Apache 2.0 license file present from commit one.** No GPL dependencies.
|
||||
10. **The bootstrap CLI must work before the first manual `curl`.** No "I'll create a key by hand in the DB just to test it" — if the CLI can't create a key, fix the CLI first.
|
||||
|
||||
## Phasing
|
||||
|
||||
Five phases. Each phase has an explicit exit criterion. **Do not start phase N+1 until phase N's exit criterion is verifiably met.** PM/Control: enforce this.
|
||||
|
||||
### Phase 1 — Scaffold
|
||||
|
||||
- Repo layout per SPEC §8
|
||||
- `pyproject.toml`, `uv.lock`, Dockerfile, docker-compose.yml, docker-compose.dev.yml, .env.example, README, LICENSE
|
||||
- Alembic configured; migration `0001_initial.py` creates schema `gateway` and all tables per SPEC §5
|
||||
- `make` or `just` targets: `dev`, `test`, `lint`, `typecheck`, `migrate`, `compose-up`, `compose-down`
|
||||
- CI workflow runs: ruff, mypy, pytest, bandit, pip-audit
|
||||
- **Exit criterion:** `docker compose -f docker-compose.dev.yml up` brings up postgres + redis + a stub gateway that responds 200 on `/healthz` and 503 on `/readyz` (because no Ollama yet). Migrations apply cleanly. CI is green on an empty test suite.
|
||||
|
||||
### Phase 2 — Core proxy + auth
|
||||
|
||||
- Bootstrap CLI (`create-tenant`, `create-key`, `list-keys`, `revoke-key`) working end-to-end
|
||||
- Argon2id hashing module with unit tests covering: hash, verify, constant-time behavior, rehash-on-parameter-change
|
||||
- Auth middleware: Bearer extraction, prefix lookup, hash verify, Redis cache with TTL
|
||||
- Ollama proxy for `/api/chat` and `/api/generate` — both streamed (NDJSON) and non-streamed
|
||||
- Endpoint allowlist enforced
|
||||
- **Model discovery (SPEC §4.6):** background poll of Ollama `/api/tags`, cached in Redis + in-process, fail-closed when unavailable
|
||||
- Model allowlist enforced per-tenant via the **effective set** (allow_all → all discovered; else `allowed_models ∩ discovered`); key-level `allow_all_models` overrides tenant
|
||||
- Error handler: sanitized responses, request ID in every error
|
||||
- Audit log writer (buffered, async)
|
||||
- Mock Ollama in `tests/integration/mock_ollama.py` (no real model required for CI)
|
||||
- **Exit criterion:** A key created via CLI can call `/api/chat` and `/api/generate` through Caddy → gateway → mock Ollama, streaming works, audit rows land in Postgres with correct token counts, `/api/pull` returns 403, no-auth returns 401, wrong-key returns 401. Model discovery populates from the (mock) Ollama `/api/tags`; `/api/tags` returns the tenant's effective set; an `allow_all_models` tenant sees all discovered models, a default-deny tenant sees only `allowed ∩ discovered`, and a non-effective model returns 403; discovery-unavailable fails closed. Integration tests cover all of the above.
|
||||
|
||||
### Phase 3 — Rate limit + budget + OpenAI-compat
|
||||
|
||||
- Sliding window rate limit (Redis Lua script) — per-key RPM, per-tenant RPM, per-key TPM
|
||||
- Concurrency semaphore (Redis-backed) with TTL guard
|
||||
- Token budget counters in Redis with Postgres ledger reconciliation on period rollover
|
||||
- OpenAI-compatibility layer: `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/models` with full SSE streaming and `data: [DONE]` terminator
|
||||
- Schema translation tests with golden fixtures (request in OpenAI → expected Ollama request; response from Ollama → expected OpenAI response)
|
||||
- Rate-limit and budget response headers per SPEC §6.5
|
||||
- **Exit criterion:** Locust test (100 concurrent users, 5 min) shows correct 429 behavior at the limit, correct token accounting, p99 gateway overhead < 25 ms. OpenAI Python SDK pointed at `/v1` successfully completes streaming chat. Killing Redis mid-test produces 503 (fail closed), not 200.
|
||||
|
||||
### Phase 4 — Audit, prompt log, revocation
|
||||
|
||||
- Prompt log (opt-in per key, TTL) with daily sweeper task
|
||||
- Audit log retention sweeper (TTL per tenant config)
|
||||
- Buffered audit writer with ring-buffer overflow → deny-mode behavior
|
||||
- Revocation flow: console (simulated via direct INSERT in tests) writes `gateway.revocations` → NOTIFY → gateway evicts Redis cache → next request with revoked key returns 401 within 1 second
|
||||
- Prometheus `/metrics` (loopback only) with: `gateway_requests_total{tenant,model,status}`, `gateway_tokens_total{tenant,model,direction}`, `gateway_request_duration_seconds{tenant,model}` (histogram)
|
||||
- `/readyz` checks DB + Redis + Ollama all reachable
|
||||
- Circuit breaker on Ollama failures
|
||||
- **Exit criterion:** Revocation E2E test green. Prompt log retention TTL works (use freeze-time to simulate). Metrics scrape returns valid Prometheus exposition. `/readyz` flips to 503 when any dependency is down.
|
||||
|
||||
### Phase 5 — Harden, document, release
|
||||
|
||||
- `docs/ARCHITECTURE.md`, `docs/DEPLOYMENT.md`, `docs/API.md`, `docs/THREAT_MODEL.md`, `docs/OPERATIONS.md` complete
|
||||
- Caddyfile example with Let's Encrypt for `api.neuronetz.ai` and security headers (HSTS, X-Content-Type-Options, no Server header, no X-Powered-By)
|
||||
- Systemd unit file for non-Compose deployments
|
||||
- Multi-stage Dockerfile with non-root user, distroless or `python:3.12-slim` final stage, no build tools in final image
|
||||
- `pip-audit` and `bandit` clean in CI
|
||||
- Image scan (Trivy or Grype) clean of HIGH/CRITICAL
|
||||
- Tag `v0.1.0`, build and push image, GitHub release with changelog
|
||||
- **Exit criterion:** Every box in SPEC §12 checked, signed off by Control. Image runnable from a fresh host with only docker + a `.env`. README quickstart works for someone who has never seen the repo.
|
||||
|
||||
## Agent Role Assignments
|
||||
|
||||
For the multi-agent orchestrator (Fritz/UI-UX/DevOps/QA/Control/Timo/PM):
|
||||
|
||||
| Agent | Owns |
|
||||
|---|---|
|
||||
| **Backend / Fritz** | All Python code under `src/neuronetz_gateway/`, Alembic migrations, CLI. Primary author. |
|
||||
| **DevOps** | Dockerfile, docker-compose.yml(s), Caddyfile, systemd unit, CI workflows, image scanning, release tagging. |
|
||||
| **QA** | All tests under `tests/`. Owns coverage gate. Writes the locust scenarios. Verifies acceptance criteria at each phase exit. |
|
||||
| **UI-UX** | Not active this project (no UI surface here). Console project will pick this up. |
|
||||
| **Control / Timo** | Enforces phase gates. Refuses to advance a phase whose exit criterion isn't met. Runs the acceptance checklist at end of Phase 5. |
|
||||
| **PM** | Tracks the phase progression, opens YouTrack tickets per phase, runs daily standups against this prompt, surfaces blockers. |
|
||||
|
||||
## Working Agreements
|
||||
|
||||
- **Branch per phase.** `phase-1-scaffold`, `phase-2-proxy-auth`, etc. Merge to `main` only after phase exit criterion is verified.
|
||||
- **PRs are reviewed against SPEC.md.** "Does this match the spec? If not, is SPEC.md wrong or is the PR wrong?" — that's the review question.
|
||||
- **SPEC changes are explicit.** If a phase reveals a spec mistake, amend SPEC.md in a separate PR before changing the implementation. Never drift silently.
|
||||
- **Commit messages reference the section.** e.g. `auth: implement argon2id verify per SPEC §5, §9`.
|
||||
- **No TODOs in main.** If something is deferred, it becomes a tracked issue, not a code comment.
|
||||
- **Open questions (SPEC §13) are resolved in writing.** Decision goes in SPEC.md, not in a Slack message that gets lost.
|
||||
|
||||
## What "Done" Looks Like
|
||||
|
||||
A fresh clone, a fresh host, a domain pointing at it, and a `.env` file. `docker compose up`. Five minutes later, `curl -H "Authorization: Bearer nz_..." https://api.neuronetz.ai/v1/chat/completions -d '...'` streams a response. The Ollama port is not open. The audit log has a row. The budget counter decremented. The metrics endpoint shows the request. The locust suite passes. The threat model document explains every defense.
|
||||
|
||||
When all of that is true and SPEC §12 is fully ticked, ship v0.1.0.
|
||||
|
||||
## When You Get Stuck
|
||||
|
||||
- **Ambiguity in the spec → ask, don't guess.** Open a question in the PM channel; if resolved, amend SPEC.md.
|
||||
- **Conflict between speed and correctness → correctness wins.** This is security infrastructure. We do not ship "good enough."
|
||||
- **Conflict between scope creep and v0.1.0 → defer.** New ideas go in a follow-up issue. v0.1.0 ships per spec.
|
||||
|
||||
Start with Phase 1. Read SPEC.md first.
|
||||
593
scope-docs/SPEC.md
Normal file
593
scope-docs/SPEC.md
Normal file
@@ -0,0 +1,593 @@
|
||||
# neuronetz-gateway — SPEC.md
|
||||
|
||||
**Project:** `neuronetz-gateway`
|
||||
**Version:** 0.1.0 (target)
|
||||
**Status:** Specification — not yet implemented
|
||||
**License:** Apache 2.0
|
||||
**Owner:** Stephan Berbig / Neuronetz
|
||||
|
||||
---
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
A secure, multi-tenant API gateway in front of an Ollama instance currently exposed at `https://api.neuronetz.ai`. The Ollama endpoint must never be reachable directly from the public internet again. All access flows through this gateway.
|
||||
|
||||
The gateway is the **hot path** of the Neuronetz API. A separate service (`neuronetz-console`, built on the Nibiru PHP framework) handles administration, dashboards, and tenant self-service. This SPEC covers only the gateway.
|
||||
|
||||
## 2. Scope
|
||||
|
||||
### In scope (v0.1.0)
|
||||
|
||||
- Authentication via API keys (Bearer tokens)
|
||||
- Multi-tenant data model (tenants → keys, with inheritance)
|
||||
- Per-key and per-tenant rate limiting (RPM, TPM, concurrent)
|
||||
- Per-key and per-tenant token budgets (daily, monthly, total)
|
||||
- Streaming and non-streaming proxy to Ollama
|
||||
- Dual API surface: native Ollama (`/api/*`) and OpenAI-compatible (`/v1/*`)
|
||||
- Endpoint allowlist (block all model-mutating Ollama endpoints)
|
||||
- **Dynamic model discovery** from the Ollama backend — the live set of installed models is queried, cached, and auto-refreshed; nothing about the model list is hand-maintained
|
||||
- Model allowlist (per-tenant override), **default-deny, resolved against the live discovered set** (stale/typo'd entries never resolve)
|
||||
- **Per-tenant `allow_all_models` toggle** — opt-in: a flagged tenant may use any currently-installed model, so models newly pulled into Ollama are auto-granted on the next discovery refresh
|
||||
- Request size limits, response size limits, timeouts
|
||||
- Token counting from Ollama responses (precise, not heuristic)
|
||||
- Audit log (always-on metadata)
|
||||
- Prompt log (opt-in per key, TTL'd retention)
|
||||
- Bootstrap CLI: create tenants, keys, set budgets
|
||||
- Health and readiness endpoints
|
||||
- Docker Compose deployment (gateway + caddy + postgres + redis + ollama)
|
||||
- Caddy as TLS terminator (Let's Encrypt for `api.neuronetz.ai`)
|
||||
|
||||
### Out of scope (v0.1.0, document as future)
|
||||
|
||||
- Web admin UI (lives in `neuronetz-console`, separate repo)
|
||||
- Billing / Stripe integration (budgets only, no money yet)
|
||||
- Multi-region / HA / k8s
|
||||
- Content moderation / prompt-injection filtering
|
||||
- Response caching
|
||||
- Multi-backend routing (one Ollama; pluggable backend interface stays for later)
|
||||
- Webhook notifications
|
||||
- SSO / OAuth2 for admin
|
||||
|
||||
## 3. Threat Model (abbreviated)
|
||||
|
||||
| Threat | Mitigation |
|
||||
|---|---|
|
||||
| Internet scanners hitting Ollama directly | Ollama bound to internal Docker network; never published |
|
||||
| Unauthenticated API abuse | Mandatory Bearer token; fail-closed on auth errors |
|
||||
| API key brute force | Argon2id hashing; constant-time compare; rate limit on auth failures per source IP |
|
||||
| GPU/token exhaustion (cost attack) | Per-key TPM + token budget; per-tenant ceiling; concurrent connection cap |
|
||||
| Resource exhaustion via large payloads | Request body size limit (default 256 KiB); `num_predict` cap (default 4096) |
|
||||
| Model enumeration / training-data exfil via uncommon models | Model allowlist; default-deny. `allow_all_models` is **opt-in per tenant and audited**. Discovery only ever exposes models actually installed on the backend; `/api/tags` and `/v1/models` never reveal models outside the tenant's effective set; "not allowed" and "doesn't exist" return the same generic response |
|
||||
| Discovery backend unreachable | Fail-closed: an empty/stale-expired discovered set means no model resolves, so requests are denied — never "allow because we couldn't list models" |
|
||||
| Ollama mutation (model pull/delete) by attacker | Endpoint allowlist; mutating endpoints (`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`) hard-blocked at the gateway |
|
||||
| Information disclosure via error messages | Sanitize upstream errors; never proxy Ollama internals to client |
|
||||
| Audit log tampering | Append-only at app layer; DB role separation; optional WAL archiving |
|
||||
| Prompt data leakage | Prompt logging off by default; opt-in per key; TTL'd; redaction hook |
|
||||
| Redis outage causing "fail open" | Fail-closed: if rate-limit/budget backend is unavailable, deny |
|
||||
| Compromised admin token | Admin token lives in `neuronetz-console`, not in gateway; gateway has no admin endpoints |
|
||||
|
||||
## 4. Architecture
|
||||
|
||||
### 4.1 Component diagram
|
||||
|
||||
```
|
||||
Internet
|
||||
│ TLS
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ Caddy (sidecar) │ Let's Encrypt for api.neuronetz.ai
|
||||
│ - TLS termination │ HSTS, security headers
|
||||
│ - HTTP/2, HTTP/3 │
|
||||
└──────────┬───────────┘
|
||||
│ HTTP/1.1 internal
|
||||
┌──────────▼───────────┐
|
||||
│ neuronetz-gateway │ FastAPI + uvicorn
|
||||
│ - authn │
|
||||
│ - rate limit │
|
||||
│ - budget check │
|
||||
│ - proxy + stream │
|
||||
│ - token count │
|
||||
│ - audit write │
|
||||
└──┬────────┬──────┬───┘
|
||||
│ │ │
|
||||
┌──────▼──┐ ┌──▼───┐ │
|
||||
│Postgres │ │Redis │ │
|
||||
│ schema: │ │ keys │ │
|
||||
│ gateway │ │bucket│ │
|
||||
└─────────┘ └──────┘ │
|
||||
│ internal network only
|
||||
┌──────▼──────┐
|
||||
│ Ollama │
|
||||
│ 127.0.0.1 │
|
||||
└─────────────┘
|
||||
|
||||
Same Compose stack also hosts (separate from this SPEC):
|
||||
- neuronetz-console (PHP/Nibiru) → reads schema `console`, reads schema `gateway` (SELECT)
|
||||
```
|
||||
|
||||
### 4.2 Database schemas
|
||||
|
||||
**Single Postgres instance, two schemas:**
|
||||
|
||||
- `gateway` — owned by the gateway service; gateway role has full DDL
|
||||
- `console` — owned by `neuronetz-console` (out of scope here); console role has full DDL
|
||||
- Both services connect with their own role. Cross-schema access is explicit GRANT.
|
||||
|
||||
**Console role gets `SELECT` on all `gateway.*` tables.** Console writes go only to `console.*` tables. If the console needs to mutate gateway state (e.g. revoke a key), it does so by writing to a `gateway.revocations` outbox table that the gateway tails (see §4.5).
|
||||
|
||||
### 4.3 Request lifecycle
|
||||
|
||||
1. Caddy terminates TLS, forwards to gateway on internal port.
|
||||
2. Gateway middleware extracts `Authorization: Bearer <key>`.
|
||||
3. Key prefix (first 12 chars) used as Redis cache key. On miss, lookup `gateway.api_keys` by prefix; verify full key with argon2id `verify`; cache resolved key metadata in Redis (TTL 60s).
|
||||
4. Rate limit check (sliding window in Redis, Lua-atomic) — per-key RPM + per-tenant RPM.
|
||||
5. Budget check (Redis counter for current period; Postgres ledger is source of truth on reset).
|
||||
6. Concurrent-connection semaphore (Redis `INCR` with TTL).
|
||||
7. Model allowlist check. Resolve the **effective model set** for the key:
|
||||
`allow_all := key.allow_all_models ?? tenant.allow_all_models`;
|
||||
`effective := discovered` if `allow_all` else `(key.allowed_models ?? tenant.allowed_models) ∩ discovered`,
|
||||
where `discovered` is the cached live model set from discovery (§4.6). The request's
|
||||
`model` must be in `effective`, else a generic 403 with no disclosure of whether the
|
||||
model exists but is unpermitted vs. is not installed.
|
||||
8. Endpoint allowlist check.
|
||||
9. Request body validation (size, schema, `num_predict` cap).
|
||||
10. If OpenAI-compat path, translate request to Ollama schema.
|
||||
11. Open httpx async stream to Ollama.
|
||||
12. Stream response back to client, accumulating final `prompt_eval_count` + `eval_count`.
|
||||
13. On stream close: write `gateway.audit_log` row; decrement budget; release semaphore; if prompt logging enabled, write `gateway.prompt_log` row.
|
||||
14. On any failure: sanitized error to client, audit row with status code, semaphore released.
|
||||
|
||||
### 4.4 Failure modes (fail-closed)
|
||||
|
||||
| Subsystem | If down | Behavior |
|
||||
|---|---|---|
|
||||
| Postgres (read) | Key lookup fails | 503 with retry-after; no requests proxied |
|
||||
| Postgres (write) | Audit write fails | Request still succeeds, audit row buffered in-memory ring (max 1000), drained on recovery; if buffer fills, switch to deny mode |
|
||||
| Redis | Rate limit / budget unavailable | 503 — fail closed. Never "allow because we can't check." |
|
||||
| Ollama | Upstream unreachable | 502 with retry-after; circuit breaker opens after 5 consecutive failures, half-open after 30s |
|
||||
| Caddy | Not a gateway concern | — |
|
||||
|
||||
### 4.5 Cache invalidation (key revocation)
|
||||
|
||||
Console can revoke a key by inserting into `gateway.revocations(key_id, ts, reason)`. Gateway has a background task (`asyncio.create_task` in lifespan) that:
|
||||
- LISTENs on Postgres channel `key_revoked` (gateway emits NOTIFY on its own write path; console emits via INSERT trigger)
|
||||
- On notification, evicts the Redis cache entry for that key's prefix
|
||||
- This makes revocation effectively immediate (≤ Redis RTT) without cross-service HTTP
|
||||
|
||||
### 4.6 Model discovery
|
||||
|
||||
The set of usable models is **never hand-maintained**; it is extracted live from the
|
||||
Ollama backend.
|
||||
|
||||
- A background task (started in lifespan, like the revocation listener) polls Ollama
|
||||
`GET /api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds.
|
||||
- The parsed model set (names + sanitized metadata: family, parameter size, quantization,
|
||||
size bytes, modified-at) is cached in Redis under `gateway:models:discovered` with TTL
|
||||
`MODEL_DISCOVERY_CACHE_TTL_S`, and held in-process for hot reads on the request path.
|
||||
- On startup an initial fetch runs; if Ollama is unreachable the discovered set is empty.
|
||||
- **Fail-closed:** if the discovered set is empty or its cache has expired and cannot be
|
||||
refreshed, no model resolves and requests are denied (consistent with default-deny).
|
||||
Discovery never opens access on failure.
|
||||
- "Auto-grant": because the effective set (§4.3 step 7) intersects with `discovered` (or
|
||||
*is* `discovered` when `allow_all_models`), a model pulled into Ollama out-of-band
|
||||
becomes usable to `allow_all` tenants on the next refresh — no per-tenant config change.
|
||||
- Discovery is **read-only** against Ollama and uses only the allowlisted `/api/tags`
|
||||
endpoint; it never triggers a model pull.
|
||||
|
||||
## 5. Data Model (schema `gateway`)
|
||||
|
||||
```sql
|
||||
CREATE SCHEMA gateway;
|
||||
|
||||
CREATE TYPE gateway.key_status AS ENUM ('active', 'disabled', 'revoked');
|
||||
CREATE TYPE gateway.tenant_status AS ENUM ('active', 'suspended', 'closed');
|
||||
CREATE TYPE gateway.budget_period AS ENUM ('day', 'month', 'total');
|
||||
|
||||
CREATE TABLE gateway.tenants (
|
||||
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
name text NOT NULL UNIQUE,
|
||||
status gateway.tenant_status NOT NULL DEFAULT 'active',
|
||||
created_at timestamptz NOT NULL DEFAULT now(),
|
||||
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
|
||||
);
|
||||
|
||||
CREATE TABLE gateway.tenant_limits (
|
||||
tenant_id uuid PRIMARY KEY REFERENCES gateway.tenants(id) ON DELETE CASCADE,
|
||||
rpm integer NOT NULL DEFAULT 60,
|
||||
tpm integer NOT NULL DEFAULT 100000,
|
||||
concurrent integer NOT NULL DEFAULT 8,
|
||||
tokens_daily bigint,
|
||||
tokens_monthly bigint,
|
||||
tokens_total bigint,
|
||||
allowed_models text[] NOT NULL DEFAULT '{}',
|
||||
allow_all_models boolean NOT NULL DEFAULT false, -- opt-in: allow any installed model
|
||||
log_prompts_default boolean NOT NULL DEFAULT false,
|
||||
prompt_retention_days integer NOT NULL DEFAULT 30,
|
||||
audit_retention_days integer NOT NULL DEFAULT 365
|
||||
);
|
||||
|
||||
CREATE TABLE gateway.api_keys (
|
||||
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
tenant_id uuid NOT NULL REFERENCES gateway.tenants(id) ON DELETE CASCADE,
|
||||
prefix text NOT NULL UNIQUE, -- first 12 chars, indexed
|
||||
key_hash text NOT NULL, -- argon2id
|
||||
name text NOT NULL,
|
||||
status gateway.key_status NOT NULL DEFAULT 'active',
|
||||
scopes text[] NOT NULL DEFAULT '{chat,embeddings}',
|
||||
created_at timestamptz NOT NULL DEFAULT now(),
|
||||
last_used_at timestamptz,
|
||||
expires_at timestamptz,
|
||||
log_prompts boolean, -- NULL = inherit from tenant
|
||||
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
|
||||
);
|
||||
|
||||
CREATE INDEX idx_api_keys_prefix ON gateway.api_keys(prefix) WHERE status = 'active';
|
||||
CREATE INDEX idx_api_keys_tenant ON gateway.api_keys(tenant_id);
|
||||
|
||||
CREATE TABLE gateway.key_limits (
|
||||
key_id uuid PRIMARY KEY REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
|
||||
rpm integer, -- NULL = inherit tenant
|
||||
tpm integer,
|
||||
concurrent integer,
|
||||
tokens_daily bigint,
|
||||
tokens_monthly bigint,
|
||||
tokens_total bigint,
|
||||
allowed_models text[], -- NULL = inherit tenant
|
||||
allow_all_models boolean -- NULL = inherit tenant
|
||||
);
|
||||
|
||||
CREATE TABLE gateway.budget_usage (
|
||||
key_id uuid NOT NULL REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
|
||||
period gateway.budget_period NOT NULL,
|
||||
period_start timestamptz NOT NULL,
|
||||
tokens_in bigint NOT NULL DEFAULT 0,
|
||||
tokens_out bigint NOT NULL DEFAULT 0,
|
||||
requests bigint NOT NULL DEFAULT 0,
|
||||
PRIMARY KEY (key_id, period, period_start)
|
||||
);
|
||||
|
||||
CREATE INDEX idx_budget_usage_period ON gateway.budget_usage(period, period_start);
|
||||
|
||||
CREATE TABLE gateway.audit_log (
|
||||
id bigserial PRIMARY KEY,
|
||||
ts timestamptz NOT NULL DEFAULT now(),
|
||||
request_id uuid NOT NULL,
|
||||
tenant_id uuid, -- nullable for auth-failed rows
|
||||
key_id uuid,
|
||||
key_prefix text, -- denormalized for forensic queries
|
||||
method text NOT NULL,
|
||||
path text NOT NULL,
|
||||
model text,
|
||||
tokens_in integer,
|
||||
tokens_out integer,
|
||||
latency_ms integer,
|
||||
status integer NOT NULL,
|
||||
client_ip inet,
|
||||
user_agent text,
|
||||
error_code text
|
||||
);
|
||||
|
||||
CREATE INDEX idx_audit_ts ON gateway.audit_log(ts);
|
||||
CREATE INDEX idx_audit_tenant_ts ON gateway.audit_log(tenant_id, ts);
|
||||
CREATE INDEX idx_audit_key_ts ON gateway.audit_log(key_id, ts);
|
||||
|
||||
CREATE TABLE gateway.prompt_log (
|
||||
id bigserial PRIMARY KEY,
|
||||
audit_id bigint NOT NULL REFERENCES gateway.audit_log(id) ON DELETE CASCADE,
|
||||
ts timestamptz NOT NULL DEFAULT now(),
|
||||
key_id uuid NOT NULL,
|
||||
request_body jsonb NOT NULL,
|
||||
response_text text,
|
||||
retention_until timestamptz NOT NULL
|
||||
);
|
||||
|
||||
CREATE INDEX idx_prompt_log_retention ON gateway.prompt_log(retention_until);
|
||||
|
||||
CREATE TABLE gateway.revocations (
|
||||
id bigserial PRIMARY KEY,
|
||||
key_id uuid NOT NULL,
|
||||
ts timestamptz NOT NULL DEFAULT now(),
|
||||
reason text,
|
||||
processed_at timestamptz
|
||||
);
|
||||
|
||||
-- Trigger to NOTIFY on revocation insert
|
||||
CREATE OR REPLACE FUNCTION gateway.notify_key_revoked() RETURNS trigger AS $$
|
||||
BEGIN
|
||||
PERFORM pg_notify('key_revoked', NEW.key_id::text);
|
||||
RETURN NEW;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
CREATE TRIGGER trg_notify_key_revoked
|
||||
AFTER INSERT ON gateway.revocations
|
||||
FOR EACH ROW EXECUTE FUNCTION gateway.notify_key_revoked();
|
||||
|
||||
-- Grants for console role (created in console SPEC, referenced here)
|
||||
-- GRANT USAGE ON SCHEMA gateway TO console_role;
|
||||
-- GRANT SELECT ON ALL TABLES IN SCHEMA gateway TO console_role;
|
||||
-- GRANT INSERT ON gateway.revocations TO console_role;
|
||||
```
|
||||
|
||||
## 6. API Surface
|
||||
|
||||
### 6.1 Native Ollama passthrough (allowlisted)
|
||||
|
||||
| Path | Method | Notes |
|
||||
|---|---|---|
|
||||
| `/api/chat` | POST | Streamed (NDJSON) and non-streamed |
|
||||
| `/api/generate` | POST | Streamed (NDJSON) and non-streamed |
|
||||
| `/api/embeddings` | POST | Non-streamed |
|
||||
| `/api/embed` | POST | Newer Ollama embeddings endpoint |
|
||||
| `/api/tags` | GET | Returns the tenant's **effective** model set (live-discovered ∩ allowed, or *all* discovered when `allow_all_models`). Sourced from discovery (§4.6), never a static list |
|
||||
| `/api/show` | POST | Allowed only for models in the tenant's effective set; returns sanitized model info (no system prompts, no template) |
|
||||
| `/api/ps` | GET | **Blocked** — leaks loaded models |
|
||||
| `/api/version` | GET | Returns gateway version, not Ollama version |
|
||||
|
||||
### 6.2 Hard-blocked Ollama endpoints (always 403)
|
||||
|
||||
`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`, `/api/blobs/*`
|
||||
|
||||
### 6.3 OpenAI-compatible
|
||||
|
||||
| Path | Method | Maps to |
|
||||
|---|---|---|
|
||||
| `/v1/chat/completions` | POST | `/api/chat` |
|
||||
| `/v1/completions` | POST | `/api/generate` |
|
||||
| `/v1/embeddings` | POST | `/api/embed` |
|
||||
| `/v1/models` | GET | `/api/tags` (the tenant's effective discovered set), in OpenAI model-list format |
|
||||
|
||||
Translation must preserve streaming. SSE (`data: {...}\n\n`) for OpenAI-compat; NDJSON for native.
|
||||
|
||||
### 6.4 Gateway endpoints
|
||||
|
||||
| Path | Method | Auth | Purpose |
|
||||
|---|---|---|---|
|
||||
| `/healthz` | GET | none | Liveness — process responsive |
|
||||
| `/readyz` | GET | none | Readiness — DB + Redis + Ollama all reachable |
|
||||
| `/metrics` | GET | none (loopback only) | Prometheus exposition (counters, histograms) |
|
||||
|
||||
No admin endpoints. Admin lives in `neuronetz-console`.
|
||||
|
||||
### 6.5 Response headers
|
||||
|
||||
Every proxied response carries:
|
||||
- `X-Request-ID: <uuid>`
|
||||
- `X-RateLimit-Limit-Requests: <n>`
|
||||
- `X-RateLimit-Remaining-Requests: <n>`
|
||||
- `X-RateLimit-Limit-Tokens: <n>`
|
||||
- `X-RateLimit-Remaining-Tokens: <n>`
|
||||
- `X-Budget-Period: day|month|total`
|
||||
- `X-Budget-Tokens-Remaining: <n>`
|
||||
|
||||
429 responses additionally carry `Retry-After: <seconds>`.
|
||||
|
||||
## 7. Configuration
|
||||
|
||||
All via environment variables, validated by Pydantic Settings on boot. Boot fails loudly on invalid config.
|
||||
|
||||
```
|
||||
# Service
|
||||
GATEWAY_BIND_HOST=0.0.0.0
|
||||
GATEWAY_BIND_PORT=8080
|
||||
GATEWAY_LOG_LEVEL=INFO
|
||||
GATEWAY_LOG_FORMAT=json # json|console
|
||||
GATEWAY_REQUEST_ID_HEADER=X-Request-ID
|
||||
GATEWAY_TRUSTED_PROXIES=127.0.0.1,caddy # for X-Forwarded-For
|
||||
|
||||
# Upstream
|
||||
OLLAMA_BASE_URL=http://ollama:11434
|
||||
OLLAMA_CONNECT_TIMEOUT_S=5
|
||||
OLLAMA_READ_TIMEOUT_S=600
|
||||
OLLAMA_MAX_CONNECTIONS=64
|
||||
|
||||
# Model discovery (§4.6)
|
||||
MODEL_DISCOVERY_REFRESH_S=60 # how often to re-query Ollama /api/tags
|
||||
MODEL_DISCOVERY_CACHE_TTL_S=120 # Redis cache TTL for the discovered model set
|
||||
|
||||
# Database
|
||||
DATABASE_URL=postgresql+asyncpg://gateway:...@postgres:5432/neuronetz
|
||||
DATABASE_POOL_SIZE=10
|
||||
DATABASE_POOL_OVERFLOW=20
|
||||
|
||||
# Redis
|
||||
REDIS_URL=redis://redis:6379/0
|
||||
REDIS_KEY_CACHE_TTL_S=60
|
||||
|
||||
# Limits (defaults; per-tenant/key overrides in DB)
|
||||
DEFAULT_RPM=60
|
||||
DEFAULT_TPM=100000
|
||||
DEFAULT_CONCURRENT=8
|
||||
MAX_REQUEST_BODY_BYTES=262144
|
||||
MAX_NUM_PREDICT=4096
|
||||
|
||||
# Security
|
||||
ARGON2_TIME_COST=3
|
||||
ARGON2_MEMORY_COST_KIB=65536
|
||||
ARGON2_PARALLELISM=4
|
||||
AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN=20
|
||||
|
||||
# Audit
|
||||
AUDIT_BUFFER_SIZE=1000
|
||||
PROMPT_LOG_DEFAULT_RETENTION_DAYS=30
|
||||
AUDIT_LOG_DEFAULT_RETENTION_DAYS=365
|
||||
```
|
||||
|
||||
## 8. Repository Layout
|
||||
|
||||
```
|
||||
neuronetz-gateway/
|
||||
├── pyproject.toml # uv-managed, ruff, mypy --strict, pytest
|
||||
├── README.md
|
||||
├── LICENSE # Apache 2.0
|
||||
├── docker-compose.yml # full stack incl. console placeholder
|
||||
├── docker-compose.dev.yml # without caddy, gateway exposed on localhost
|
||||
├── Dockerfile # multi-stage, python:3.12-slim base
|
||||
├── .env.example
|
||||
├── .dockerignore
|
||||
├── .gitignore
|
||||
├── alembic.ini
|
||||
├── alembic/
|
||||
│ ├── env.py
|
||||
│ └── versions/
|
||||
│ └── 0001_initial.py # creates schema `gateway` and all tables
|
||||
├── ops/
|
||||
│ ├── caddy/
|
||||
│ │ └── Caddyfile.example
|
||||
│ └── systemd/
|
||||
│ └── neuronetz-gateway.service
|
||||
├── src/neuronetz_gateway/
|
||||
│ ├── __init__.py
|
||||
│ ├── __main__.py # uvicorn entry
|
||||
│ ├── app.py # FastAPI factory
|
||||
│ ├── config.py # Pydantic Settings
|
||||
│ ├── deps.py # DI providers
|
||||
│ ├── lifespan.py # startup/shutdown, NOTIFY listener
|
||||
│ ├── errors.py # exception types, handlers, sanitization
|
||||
│ ├── auth/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── hashing.py # argon2id wrapper
|
||||
│ │ ├── keys.py # key generation, prefix, verify
|
||||
│ │ └── middleware.py
|
||||
│ ├── ratelimit/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── sliding_window.py # Redis Lua script
|
||||
│ │ └── concurrency.py # semaphore via Redis
|
||||
│ ├── budget/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── counter.py # Redis period counters
|
||||
│ │ └── ledger.py # Postgres reconciliation
|
||||
│ ├── proxy/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── ollama.py # httpx streaming client
|
||||
│ │ ├── translate.py # OpenAI <-> Ollama schemas
|
||||
│ │ ├── token_counter.py # parse usage from stream
|
||||
│ │ ├── discovery.py # live model discovery from Ollama /api/tags (§4.6)
|
||||
│ │ └── allowlist.py # effective-set resolution (allow_all / allowed ∩ discovered)
|
||||
│ ├── routes/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── ollama_native.py
|
||||
│ │ ├── openai_compat.py
|
||||
│ │ └── health.py
|
||||
│ ├── db/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── session.py
|
||||
│ │ ├── models.py # SQLAlchemy 2.0
|
||||
│ │ └── repositories.py
|
||||
│ ├── audit/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── writer.py # buffered async writer
|
||||
│ │ └── prompt_log.py
|
||||
│ ├── observability/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── logging.py # structlog config
|
||||
│ │ └── metrics.py # prometheus
|
||||
│ └── cli/
|
||||
│ ├── __init__.py
|
||||
│ └── manage.py # typer: create-tenant, create-key, ...
|
||||
├── tests/
|
||||
│ ├── conftest.py # testcontainers fixtures
|
||||
│ ├── unit/
|
||||
│ │ ├── test_hashing.py
|
||||
│ │ ├── test_translate.py
|
||||
│ │ ├── test_token_counter.py
|
||||
│ │ ├── test_discovery.py
|
||||
│ │ ├── test_allowlist.py
|
||||
│ │ └── test_sliding_window.py
|
||||
│ ├── integration/
|
||||
│ │ ├── test_auth_flow.py
|
||||
│ │ ├── test_rate_limit.py
|
||||
│ │ ├── test_budget.py
|
||||
│ │ ├── test_proxy_stream.py
|
||||
│ │ ├── test_openai_compat.py
|
||||
│ │ ├── test_revocation.py
|
||||
│ │ └── mock_ollama.py # FastAPI mock with NDJSON/SSE
|
||||
│ └── load/
|
||||
│ └── locustfile.py
|
||||
└── docs/
|
||||
├── ARCHITECTURE.md
|
||||
├── DEPLOYMENT.md
|
||||
├── API.md
|
||||
├── THREAT_MODEL.md
|
||||
└── OPERATIONS.md # runbook: revoke key, rotate, check usage
|
||||
```
|
||||
|
||||
## 9. Non-Functional Requirements
|
||||
|
||||
- **Performance:** p50 overhead < 5 ms over direct Ollama call (auth + ratelimit + audit); p99 < 25 ms (excluding upstream latency)
|
||||
- **Streaming:** Time-to-first-byte must not be degraded by gateway logic — audit write happens **after** stream close
|
||||
- **Memory:** Steady-state RSS < 200 MiB per gateway worker under 100 concurrent streams
|
||||
- **Concurrency:** Handle 200 concurrent connections per worker; 4 workers per instance default
|
||||
- **Test coverage:** ≥ 85% line coverage on `src/neuronetz_gateway/` excluding `__main__` and CLI; 100% on `auth/`, `ratelimit/`, `budget/`
|
||||
- **Security:** No `eval`, no `exec`, no shell-out, no `pickle`. Bandit clean. `pip-audit` clean on every CI run.
|
||||
- **Type safety:** `mypy --strict` clean
|
||||
- **Lint:** `ruff check` clean with project ruleset (E, F, I, B, UP, S, ASYNC)
|
||||
|
||||
## 10. Tooling
|
||||
|
||||
- Python 3.12
|
||||
- `uv` for dependency management (pyproject.toml + uv.lock)
|
||||
- FastAPI ≥ 0.115, uvicorn[standard], httpx ≥ 0.27, SQLAlchemy 2.0 (async), asyncpg, redis ≥ 5.0 (with hiredis), structlog, pydantic ≥ 2.9, pydantic-settings, argon2-cffi, typer, prometheus-client
|
||||
- Test: pytest, pytest-asyncio, pytest-cov, testcontainers, httpx (test client), respx (mock), locust
|
||||
- Lint/format: ruff, mypy --strict, bandit, pip-audit
|
||||
- CI: GitHub Actions workflow (lint, type, test with coverage, build image, push on tag)
|
||||
|
||||
## 11. Bootstrap CLI (Typer)
|
||||
|
||||
```
|
||||
neuronetz-gateway create-tenant --name "acme" [--rpm 60] [--tpm 100000]
|
||||
neuronetz-gateway create-key --tenant acme --name "prod-server-1" [--scopes chat,embeddings]
|
||||
neuronetz-gateway revoke-key --prefix nz_abc12345
|
||||
neuronetz-gateway list-keys --tenant acme
|
||||
neuronetz-gateway show-usage --tenant acme [--period day|month|total]
|
||||
neuronetz-gateway set-budget --key nz_abc12345 --daily 1000000 --monthly 30000000
|
||||
neuronetz-gateway set-models --tenant acme --models llama3.1:8b,mistral:7b
|
||||
neuronetz-gateway set-models --tenant acme --allow-all # opt into allow_all_models
|
||||
neuronetz-gateway set-models --tenant acme --no-allow-all # back to explicit allowlist
|
||||
neuronetz-gateway list-models [--tenant acme] # show live-discovered models
|
||||
# (and the tenant's effective set)
|
||||
```
|
||||
|
||||
`create-tenant` accepts `--allow-all-models / --no-allow-all-models` (default off).
|
||||
`list-models` reads the discovery cache (§4.6); with `--tenant` it also shows that tenant's
|
||||
resolved effective set.
|
||||
|
||||
Key format: `nz_<12-char-prefix><32-char-random>`. Prefix is stored; full key is hashed (argon2id). On creation, the full key is printed exactly once.
|
||||
|
||||
## 12. Acceptance Criteria
|
||||
|
||||
The build is "done" when every box below is checked. The orchestrator must verify each before declaring v0.1.0.
|
||||
|
||||
- [ ] `docker compose up` from a clean checkout produces a running stack with TLS via Caddy (self-signed in dev, Let's Encrypt-ready in prod).
|
||||
- [ ] CLI creates tenant and key; printed key successfully authenticates an `/api/chat` call.
|
||||
- [ ] Unauthenticated request returns 401 with no Ollama details leaked.
|
||||
- [ ] Request to `/api/pull` returns 403 with generic error message.
|
||||
- [ ] Streaming `/api/chat` works end-to-end; first byte arrives within Ollama's own TTFB + < 10 ms gateway overhead.
|
||||
- [ ] Streaming `/v1/chat/completions` returns valid SSE with `data: [DONE]` terminator.
|
||||
- [ ] Token counts in audit log match Ollama's reported `prompt_eval_count` + `eval_count` exactly.
|
||||
- [ ] `/api/tags` and `/v1/models` reflect the **live** Ollama model set (discovery, §4.6): an `allow_all_models` tenant sees every installed model and a newly-pulled model appears within one refresh interval; a default-deny tenant sees only `allowed_models ∩ discovered`; a request for a model outside the effective set returns a generic 403; with discovery unavailable, requests fail closed (deny), not open.
|
||||
- [ ] Rate limit triggers at configured RPM with `Retry-After` header.
|
||||
- [ ] Token budget enforces and blocks at zero remaining with descriptive error.
|
||||
- [ ] Redis outage causes 503 (fail-closed), not 200.
|
||||
- [ ] Revocation via `INSERT INTO gateway.revocations` evicts Redis cache within 1 second.
|
||||
- [ ] `mypy --strict`, `ruff check`, `bandit`, `pip-audit` all clean in CI.
|
||||
- [ ] Test coverage ≥ 85% overall, 100% in `auth/`, `ratelimit/`, `budget/`.
|
||||
- [ ] `docs/THREAT_MODEL.md`, `docs/DEPLOYMENT.md`, `docs/OPERATIONS.md` present and accurate.
|
||||
- [ ] Load test (locust): 100 concurrent users sustained 5 minutes, p99 gateway overhead < 25 ms, zero 5xx outside induced failures.
|
||||
|
||||
## 13. Open Questions (decide during build)
|
||||
|
||||
1. Embedding cost accounting — Ollama doesn't return `eval_count` for embeddings. Decision: charge based on `prompt_eval_count` only; document as such.
|
||||
2. SSE vs NDJSON heuristic for OpenAI-compat — always SSE per OpenAI spec. NDJSON only on native `/api/*`.
|
||||
3. Prometheus cardinality — do not label by `key_id` (too many series); label by `tenant_id` only; per-key data lives in Postgres.
|
||||
4. **Model discovery source** — the live model list is `GET /api/tags` on the Ollama backend; there is no separate registry. Cached in Redis + in-process, refreshed every `MODEL_DISCOVERY_REFRESH_S`.
|
||||
5. **Discovery failure is fail-closed** — empty/expired discovered set ⇒ no model resolves ⇒ deny. Discovery never opens access on error.
|
||||
6. **No existence disclosure** — a model that is installed-but-unpermitted and a model that is not installed both return the same generic response, to prevent enumeration.
|
||||
7. **`allow_all_models` precedence** — key-level `allow_all_models` (when non-NULL) overrides the tenant flag; otherwise the tenant flag applies. Same NULL-inherits-tenant rule as the other key limits.
|
||||
|
||||
## 14. References
|
||||
|
||||
- Ollama API: https://github.com/ollama/ollama/blob/main/docs/api.md
|
||||
- OpenAI Chat Completions: https://platform.openai.com/docs/api-reference/chat
|
||||
- Nibiru (sibling console project): https://nibiru-framework.com
|
||||
- Argon2 RFC 9106
|
||||
Reference in New Issue
Block a user