One-command demo so the gateway can be exercised end-to-end without a GPU or a real model download: - demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags, /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count and eval_count on the final frame, /api/embed, /api/show, /api/version). Non-root multi-stage Dockerfile, never published (internal network only). - docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground. Mirrors the prod posture (mock-ollama not exposed). - demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with allow_all_models and a fresh API key via the bootstrap CLI inside the container, then prints the key, the playground URL, and five ready-to-paste curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull). ./demo.sh --down tears everything back down with volumes. - playground/index.html — single-file dark-themed UI served same-origin by the gateway at /playground (CORS-free). Per-endpoint About card with method/ auth/streaming badges, a real description, sample request body, sample response, and a footer note. Live SSE/NDJSON rendering of the response. A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh are visibly gated until an API key is in the field; the Base URL is force-pinned to location.origin three times to defeat browser autofill. - docs/ — API.md (full endpoint reference with curl, streaming formats, error model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule, pointing at a real Ollama backend, env reference), THREAT_MODEL.md (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md. mkdocs.yml (Material theme) wires them together.
8.4 KiB
neuronetz-gateway — Architecture
Distilled from scope-docs/SPEC.md §4. The SPEC is the source of truth.
The gateway is the hot path of the Neuronetz API: a secure, multi-tenant proxy in front
of an Ollama instance. The Ollama backend must never be reachable directly from the public
internet — all access flows through this gateway. Administration (dashboards, tenant
self-service) lives in a separate service, neuronetz-console, and is out of scope here.
Component diagram (SPEC §4.1)
Internet
│ TLS
▼
┌──────────────────────┐
│ Caddy (sidecar) │ Let's Encrypt for api.neuronetz.ai
│ - TLS termination │ HSTS, security headers
│ - HTTP/2, HTTP/3 │
└──────────┬───────────┘
│ HTTP/1.1 internal
┌──────────▼───────────┐
│ neuronetz-gateway │ FastAPI + uvicorn
│ - authn │
│ - rate limit │
│ - budget check │
│ - proxy + stream │
│ - token count │
│ - audit write │
└──┬────────┬──────┬───┘
│ │ │
┌──────▼──┐ ┌──▼───┐ │
│Postgres │ │Redis │ │
│ schema: │ │ keys │ │
│ gateway │ │bucket│ │
└─────────┘ └──────┘ │
│ internal network only
┌──────▼──────┐
│ Ollama │
│ 127.0.0.1 │
└─────────────┘
Same Compose stack also hosts (separate from this SPEC):
- neuronetz-console (PHP/Nibiru) → reads schema `console`, reads schema `gateway` (SELECT)
Only Caddy publishes ports. Postgres, Redis and (critically) Ollama have no published ports and are reachable only on the internal Docker network.
Database schemas (SPEC §4.2)
A single Postgres instance with two schemas:
gateway— owned by this service; full DDL. Tables:tenants,tenant_limits,api_keys,key_limits,budget_usage,audit_log,prompt_log,revocations(see SPEC §5 for the full DDL).console— owned byneuronetz-console(out of scope). The console role getsSELECTon allgateway.*tables andINSERTongateway.revocationsonly.
If the console needs to mutate gateway state (e.g. revoke a key), it does so by inserting
into the gateway.revocations outbox table, which the gateway tails (see Revocation below).
Limit inheritance: limits and budgets resolve key → tenant. A NULL key-level value
inherits the tenant value. For allow_all_models, a non-NULL key value overrides the
tenant flag; otherwise the tenant flag applies (SPEC §13.7).
Request lifecycle (SPEC §4.3)
- Caddy terminates TLS and forwards to the gateway on the internal port.
- Middleware extracts
Authorization: Bearer <key>. - The 12-char prefix is the Redis cache key. On miss, look up
gateway.api_keysby prefix, verify the full key with argon2id, and cache resolved metadata in Redis (TTL 60 s). - Rate limit check — sliding window in Redis (Lua-atomic): per-key RPM + per-tenant RPM.
- Budget check — Redis counter for the current period; Postgres ledger is the source of truth on reset.
- Concurrency semaphore — Redis
INCRwith TTL. - Model allowlist check — resolve the effective set (see below); the request
modelmust be in it, else a generic403. - Endpoint allowlist check — mutating endpoints are hard-blocked.
- Body validation — size, schema,
num_predictcap. - If an OpenAI-compat path, translate the request to the Ollama schema.
- Open an httpx async stream to Ollama.
- Stream the response back to the client, accumulating the final
prompt_eval_count+eval_count. - On stream close: write the
gateway.audit_logrow; decrement the budget; release the semaphore; if prompt logging is enabled, writegateway.prompt_log. - On any failure: sanitized error to the client, audit row with the status code, semaphore released.
Streaming integrity: token counting and the audit write happen after stream close, never on the hot path — time-to-first-byte is not degraded by bookkeeping (SPEC §9).
Model discovery (SPEC §4.6)
The set of usable models is never hand-maintained; it is extracted live from Ollama.
- A background task (started in the app lifespan, alongside the revocation listener) polls
Ollama
GET /api/tagseveryMODEL_DISCOVERY_REFRESH_Sseconds. - The parsed set (names + sanitized metadata: family, parameter size, quantization, size,
modified-at) is cached in Redis under
gateway:models:discoveredwith TTLMODEL_DISCOVERY_CACHE_TTL_S, and held in-process for hot reads on the request path. - An initial fetch runs at startup; if Ollama is unreachable the discovered set is empty.
- Fail-closed: an empty or expired-and-unrefreshable discovered set means no model resolves and requests are denied. Discovery never opens access on failure.
- Auto-grant: because the effective set intersects with
discovered(or isdiscoveredwhenallow_all_models), a model pulled into Ollama out-of-band becomes usable toallow_alltenants on the next refresh — no per-tenant config change. - Discovery is read-only against Ollama and uses only the allowlisted
/api/tagsendpoint; it never triggers a model pull.
Effective-set resolution (SPEC §4.3 step 7)
allow_all := key.allow_all_models ?? tenant.allow_all_models
effective := discovered if allow_all
(key.allowed_models ?? tenant.allowed_models) ∩ discovered otherwise
/api/tags and /v1/models return exactly this effective set, so the listing never reveals
models outside the tenant's reach. A model that is installed-but-unpermitted and one that is
not installed both return the same generic 403 — no existence disclosure (SPEC §13.6).
Failure modes — fail-closed (SPEC §4.4)
| Subsystem | If down | Behavior |
|---|---|---|
| Postgres (read) | Key lookup fails | 503 with retry-after; nothing proxied. |
| Postgres (write) | Audit write fails | Request still succeeds; audit row buffered in-memory ring (max 1000), drained on recovery; if the buffer fills, switch to deny mode. |
| Redis | Rate limit / budget unavailable | 503 — fail closed. Never "allow because we can't check." |
| Ollama | Upstream unreachable | 502 with retry-after; circuit breaker opens after 5 consecutive failures, half-open after 30 s. |
| Caddy | Not a gateway concern | — |
The governing rule (AGENT_PROMPT non-negotiable #1): if a security or budgeting check cannot be performed, deny. Never default to allow.
Cache invalidation / key revocation (SPEC §4.5)
The console revokes a key by inserting into gateway.revocations(key_id, ts, reason).
A background task in the gateway lifespan:
LISTENs on the Postgres channelkey_revoked(the gateway emitsNOTIFYon its own write path; the console's INSERT fires a trigger that emits it).- On notification, evicts the Redis cache entry for that key's prefix.
This makes revocation effectively immediate (≤ Redis RTT) with no cross-service HTTP.
Observability
- Structured logs (structlog), JSON in production. Secrets/keys are never logged.
- Prometheus
/metrics(loopback only):gateway_requests_total{tenant,model,status},gateway_tokens_total{tenant,model,direction},gateway_request_duration_seconds{tenant,model}(histogram). Labelled bytenant, never bykey_id(cardinality — SPEC §13.3); per-key data lives in Postgres. - Audit log — always-on request metadata. Prompt log — opt-in per key, TTL'd.