Files

Stephan Berbig b47a09db91 demo + playground + docs

One-command demo so the gateway can be exercised end-to-end without a GPU or a
real model download:

- demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags,
  /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count
  and eval_count on the final frame, /api/embed, /api/show, /api/version).
  Non-root multi-stage Dockerfile, never published (internal network only).
- docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with
  PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground.
  Mirrors the prod posture (mock-ollama not exposed).
- demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with
  allow_all_models and a fresh API key via the bootstrap CLI inside the
  container, then prints the key, the playground URL, and five ready-to-paste
  curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull).
  ./demo.sh --down tears everything back down with volumes.
- playground/index.html — single-file dark-themed UI served same-origin by
  the gateway at /playground (CORS-free). Per-endpoint About card with method/
  auth/streaming badges, a real description, sample request body, sample
  response, and a footer note. Live SSE/NDJSON rendering of the response.
  A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh
  are visibly gated until an API key is in the field; the Base URL is
  force-pinned to location.origin three times to defeat browser autofill.
- docs/ — API.md (full endpoint reference with curl, streaming formats, error
  model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery
  + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule,
  pointing at a real Ollama backend, env reference), THREAT_MODEL.md
  (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md
  (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md.
  mkdocs.yml (Material theme) wires them together.

2026-05-26 20:52:33 +02:00

4.6 KiB

Raw Blame History

neuronetz-gateway — Threat Model

From scope-docs/SPEC.md §3. The governing principle, in one line:

Fail closed, always. If a security or budgeting check cannot be performed (Redis down, DB unreachable, ambiguous state), deny the request. Never default to allow. (AGENT_PROMPT non-negotiable #1.)

The gateway exists because the Ollama instance at api.neuronetz.ai was exposed without authentication — a standing security incident. Every defense below traces back to closing that gap and keeping it closed.

Threats & mitigations (SPEC §3)

Threat	Mitigation
Internet scanners hitting Ollama directly	Ollama bound to the internal Docker network; never published. No `ports:` mapping in any shipped compose file.
Unauthenticated API abuse	Mandatory Bearer token; fail-closed on auth errors (401).
API key brute force	Argon2id hashing; constant-time compare; rate limit on auth failures per source IP (`AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN`).
GPU/token exhaustion (cost attack)	Per-key TPM + token budget; per-tenant ceiling; concurrent-connection cap.
Resource exhaustion via large payloads	Request body size limit (default 256 KiB); `num_predict` cap (default 4096).
Model enumeration / training-data exfil via uncommon models	Model allowlist, default-deny. Discovery only exposes models actually installed; `/api/tags` and `/v1/models` never reveal models outside the tenant's effective set; "not allowed" and "doesn't exist" return the same generic response.
Discovery backend unreachable	Fail-closed: an empty/stale-expired discovered set means no model resolves, so requests are denied — never "allow because we couldn't list models."
Ollama mutation (model pull/delete) by attacker	Endpoint allowlist; mutating endpoints (`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`, `/api/blobs/`) hard-blocked* at the gateway, not configurable.
Information disclosure via error messages	Upstream errors sanitized at the boundary; Ollama internals never proxied to the client. Each error carries an `X-Request-ID` for correlation.
Audit log tampering	Append-only at the app layer; DB role separation; optional WAL archiving.
Prompt data leakage	Prompt logging off by default; opt-in per key; TTL'd retention; redaction hook.
Redis outage causing "fail open"	Fail-closed: if the rate-limit/budget backend is unavailable, deny (503), not allow.
Compromised admin token	There is no admin endpoint in the gateway. Admin lives in `neuronetz-console`; the gateway has nothing to compromise here.

Notes on selected defenses

`allow_all_models` is an audited opt-in

allow_all_models lets a tenant use any currently-installed model, so models newly pulled into Ollama are auto-granted on the next discovery refresh. This is convenient but widens the attack surface for that tenant, so it is:

opt-in per tenant (default false), set explicitly via the CLI (create-tenant --allow-all-models or set-models --allow-all);
overridable per key — a non-NULL key-level allow_all_models overrides the tenant flag; otherwise the tenant flag applies (SPEC §13.7);
audited — every request records the model used in gateway.audit_log.

Default-deny tenants instead see only allowed_models ∩ discovered. Either way the effective set is always intersected with the live discovered set, so stale or typo'd allowlist entries never resolve.

No existence disclosure

A model that is installed-but-unpermitted and a model that is not installed both return the same generic 403. An attacker cannot use the gateway to enumerate which models exist on the backend (SPEC §13.6).

Sanitized errors + request IDs

Clients never receive Ollama's error text, stack traces, or internal hostnames. Errors are mapped to generic 4xx/5xx JSON with a request_id. Operators correlate that ID with the audit log to investigate without leaking internals to callers (SPEC §4.3 step 14).

Streaming integrity is also a safety property

Token counting and audit writes happen after stream close, never on the hot path. This keeps time-to-first-byte honest and ensures budget decrements and audit rows reflect the true final token counts reported by Ollama (prompt_eval_count + eval_count), not estimates.

Out of scope (v0.1.0)

Documented as future work, not mitigations present today: content moderation / prompt-injection filtering, response caching, multi-backend routing, billing, SSO/OAuth2 for admin, and any web admin UI (that lives in neuronetz-console).

4.6 KiB Raw Blame History