Files
neuronetz-gateway/docs/ARCHITECTURE.md
Stephan Berbig b47a09db91 demo + playground + docs
One-command demo so the gateway can be exercised end-to-end without a GPU or a
real model download:

- demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags,
  /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count
  and eval_count on the final frame, /api/embed, /api/show, /api/version).
  Non-root multi-stage Dockerfile, never published (internal network only).
- docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with
  PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground.
  Mirrors the prod posture (mock-ollama not exposed).
- demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with
  allow_all_models and a fresh API key via the bootstrap CLI inside the
  container, then prints the key, the playground URL, and five ready-to-paste
  curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull).
  ./demo.sh --down tears everything back down with volumes.
- playground/index.html — single-file dark-themed UI served same-origin by
  the gateway at /playground (CORS-free). Per-endpoint About card with method/
  auth/streaming badges, a real description, sample request body, sample
  response, and a footer note. Live SSE/NDJSON rendering of the response.
  A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh
  are visibly gated until an API key is in the field; the Base URL is
  force-pinned to location.origin three times to defeat browser autofill.
- docs/ — API.md (full endpoint reference with curl, streaming formats, error
  model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery
  + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule,
  pointing at a real Ollama backend, env reference), THREAT_MODEL.md
  (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md
  (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md.
  mkdocs.yml (Material theme) wires them together.
2026-05-26 20:52:33 +02:00

8.4 KiB

neuronetz-gateway — Architecture

Distilled from scope-docs/SPEC.md §4. The SPEC is the source of truth.

The gateway is the hot path of the Neuronetz API: a secure, multi-tenant proxy in front of an Ollama instance. The Ollama backend must never be reachable directly from the public internet — all access flows through this gateway. Administration (dashboards, tenant self-service) lives in a separate service, neuronetz-console, and is out of scope here.


Component diagram (SPEC §4.1)

                          Internet
                              │ TLS
                              ▼
                  ┌──────────────────────┐
                  │ Caddy (sidecar)      │  Let's Encrypt for api.neuronetz.ai
                  │ - TLS termination    │  HSTS, security headers
                  │ - HTTP/2, HTTP/3     │
                  └──────────┬───────────┘
                             │ HTTP/1.1 internal
                  ┌──────────▼───────────┐
                  │ neuronetz-gateway    │  FastAPI + uvicorn
                  │  - authn             │
                  │  - rate limit        │
                  │  - budget check      │
                  │  - proxy + stream    │
                  │  - token count       │
                  │  - audit write       │
                  └──┬────────┬──────┬───┘
                     │        │      │
              ┌──────▼──┐  ┌──▼───┐  │
              │Postgres │  │Redis │  │
              │ schema: │  │ keys │  │
              │ gateway │  │bucket│  │
              └─────────┘  └──────┘  │
                                     │ internal network only
                              ┌──────▼──────┐
                              │   Ollama    │
                              │ 127.0.0.1   │
                              └─────────────┘

Same Compose stack also hosts (separate from this SPEC):
  - neuronetz-console (PHP/Nibiru) → reads schema `console`, reads schema `gateway` (SELECT)

Only Caddy publishes ports. Postgres, Redis and (critically) Ollama have no published ports and are reachable only on the internal Docker network.


Database schemas (SPEC §4.2)

A single Postgres instance with two schemas:

  • gateway — owned by this service; full DDL. Tables: tenants, tenant_limits, api_keys, key_limits, budget_usage, audit_log, prompt_log, revocations (see SPEC §5 for the full DDL).
  • console — owned by neuronetz-console (out of scope). The console role gets SELECT on all gateway.* tables and INSERT on gateway.revocations only.

If the console needs to mutate gateway state (e.g. revoke a key), it does so by inserting into the gateway.revocations outbox table, which the gateway tails (see Revocation below).

Limit inheritance: limits and budgets resolve key → tenant. A NULL key-level value inherits the tenant value. For allow_all_models, a non-NULL key value overrides the tenant flag; otherwise the tenant flag applies (SPEC §13.7).


Request lifecycle (SPEC §4.3)

  1. Caddy terminates TLS and forwards to the gateway on the internal port.
  2. Middleware extracts Authorization: Bearer <key>.
  3. The 12-char prefix is the Redis cache key. On miss, look up gateway.api_keys by prefix, verify the full key with argon2id, and cache resolved metadata in Redis (TTL 60 s).
  4. Rate limit check — sliding window in Redis (Lua-atomic): per-key RPM + per-tenant RPM.
  5. Budget check — Redis counter for the current period; Postgres ledger is the source of truth on reset.
  6. Concurrency semaphore — Redis INCR with TTL.
  7. Model allowlist check — resolve the effective set (see below); the request model must be in it, else a generic 403.
  8. Endpoint allowlist check — mutating endpoints are hard-blocked.
  9. Body validation — size, schema, num_predict cap.
  10. If an OpenAI-compat path, translate the request to the Ollama schema.
  11. Open an httpx async stream to Ollama.
  12. Stream the response back to the client, accumulating the final prompt_eval_count + eval_count.
  13. On stream close: write the gateway.audit_log row; decrement the budget; release the semaphore; if prompt logging is enabled, write gateway.prompt_log.
  14. On any failure: sanitized error to the client, audit row with the status code, semaphore released.

Streaming integrity: token counting and the audit write happen after stream close, never on the hot path — time-to-first-byte is not degraded by bookkeeping (SPEC §9).


Model discovery (SPEC §4.6)

The set of usable models is never hand-maintained; it is extracted live from Ollama.

  • A background task (started in the app lifespan, alongside the revocation listener) polls Ollama GET /api/tags every MODEL_DISCOVERY_REFRESH_S seconds.
  • The parsed set (names + sanitized metadata: family, parameter size, quantization, size, modified-at) is cached in Redis under gateway:models:discovered with TTL MODEL_DISCOVERY_CACHE_TTL_S, and held in-process for hot reads on the request path.
  • An initial fetch runs at startup; if Ollama is unreachable the discovered set is empty.
  • Fail-closed: an empty or expired-and-unrefreshable discovered set means no model resolves and requests are denied. Discovery never opens access on failure.
  • Auto-grant: because the effective set intersects with discovered (or is discovered when allow_all_models), a model pulled into Ollama out-of-band becomes usable to allow_all tenants on the next refresh — no per-tenant config change.
  • Discovery is read-only against Ollama and uses only the allowlisted /api/tags endpoint; it never triggers a model pull.

Effective-set resolution (SPEC §4.3 step 7)

allow_all := key.allow_all_models ?? tenant.allow_all_models
effective := discovered                                          if allow_all
             (key.allowed_models ?? tenant.allowed_models) ∩ discovered   otherwise

/api/tags and /v1/models return exactly this effective set, so the listing never reveals models outside the tenant's reach. A model that is installed-but-unpermitted and one that is not installed both return the same generic 403 — no existence disclosure (SPEC §13.6).


Failure modes — fail-closed (SPEC §4.4)

Subsystem If down Behavior
Postgres (read) Key lookup fails 503 with retry-after; nothing proxied.
Postgres (write) Audit write fails Request still succeeds; audit row buffered in-memory ring (max 1000), drained on recovery; if the buffer fills, switch to deny mode.
Redis Rate limit / budget unavailable 503 — fail closed. Never "allow because we can't check."
Ollama Upstream unreachable 502 with retry-after; circuit breaker opens after 5 consecutive failures, half-open after 30 s.
Caddy Not a gateway concern

The governing rule (AGENT_PROMPT non-negotiable #1): if a security or budgeting check cannot be performed, deny. Never default to allow.


Cache invalidation / key revocation (SPEC §4.5)

The console revokes a key by inserting into gateway.revocations(key_id, ts, reason). A background task in the gateway lifespan:

  • LISTENs on the Postgres channel key_revoked (the gateway emits NOTIFY on its own write path; the console's INSERT fires a trigger that emits it).
  • On notification, evicts the Redis cache entry for that key's prefix.

This makes revocation effectively immediate (≤ Redis RTT) with no cross-service HTTP.


Observability

  • Structured logs (structlog), JSON in production. Secrets/keys are never logged.
  • Prometheus /metrics (loopback only): gateway_requests_total{tenant,model,status}, gateway_tokens_total{tenant,model,direction}, gateway_request_duration_seconds{tenant,model} (histogram). Labelled by tenant, never by key_id (cardinality — SPEC §13.3); per-key data lives in Postgres.
  • Audit log — always-on request metadata. Prompt log — opt-in per key, TTL'd.