One-command demo so the gateway can be exercised end-to-end without a GPU or a real model download: - demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags, /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count and eval_count on the final frame, /api/embed, /api/show, /api/version). Non-root multi-stage Dockerfile, never published (internal network only). - docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground. Mirrors the prod posture (mock-ollama not exposed). - demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with allow_all_models and a fresh API key via the bootstrap CLI inside the container, then prints the key, the playground URL, and five ready-to-paste curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull). ./demo.sh --down tears everything back down with volumes. - playground/index.html — single-file dark-themed UI served same-origin by the gateway at /playground (CORS-free). Per-endpoint About card with method/ auth/streaming badges, a real description, sample request body, sample response, and a footer note. Live SSE/NDJSON rendering of the response. A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh are visibly gated until an API key is in the field; the Base URL is force-pinned to location.origin three times to defeat browser autofill. - docs/ — API.md (full endpoint reference with curl, streaming formats, error model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule, pointing at a real Ollama backend, env reference), THREAT_MODEL.md (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md. mkdocs.yml (Material theme) wires them together.
78 lines
4.6 KiB
Markdown
78 lines
4.6 KiB
Markdown
# neuronetz-gateway — Threat Model
|
|
|
|
From [`scope-docs/SPEC.md`](../scope-docs/SPEC.md) §3. The governing principle, in one line:
|
|
|
|
> **Fail closed, always.** If a security or budgeting check cannot be performed (Redis down,
|
|
> DB unreachable, ambiguous state), **deny** the request. Never default to allow.
|
|
> (AGENT_PROMPT non-negotiable #1.)
|
|
|
|
The gateway exists because the Ollama instance at `api.neuronetz.ai` was exposed without
|
|
authentication — a standing security incident. Every defense below traces back to closing
|
|
that gap and keeping it closed.
|
|
|
|
---
|
|
|
|
## Threats & mitigations (SPEC §3)
|
|
|
|
| Threat | Mitigation |
|
|
|---|---|
|
|
| Internet scanners hitting Ollama directly | Ollama bound to the internal Docker network; **never published**. No `ports:` mapping in any shipped compose file. |
|
|
| Unauthenticated API abuse | Mandatory Bearer token; **fail-closed** on auth errors (401). |
|
|
| API key brute force | Argon2id hashing; constant-time compare; rate limit on auth failures per source IP (`AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN`). |
|
|
| GPU/token exhaustion (cost attack) | Per-key TPM + token budget; per-tenant ceiling; concurrent-connection cap. |
|
|
| Resource exhaustion via large payloads | Request body size limit (default 256 KiB); `num_predict` cap (default 4096). |
|
|
| Model enumeration / training-data exfil via uncommon models | Model allowlist, **default-deny**. Discovery only exposes models actually installed; `/api/tags` and `/v1/models` never reveal models outside the tenant's effective set; "not allowed" and "doesn't exist" return the **same** generic response. |
|
|
| Discovery backend unreachable | **Fail-closed:** an empty/stale-expired discovered set means no model resolves, so requests are denied — never "allow because we couldn't list models." |
|
|
| Ollama mutation (model pull/delete) by attacker | Endpoint allowlist; mutating endpoints (`/api/pull`, `/api/push`, `/api/create`, `/api/copy`, `/api/delete`, `/api/blobs/*`) **hard-blocked** at the gateway, not configurable. |
|
|
| Information disclosure via error messages | Upstream errors **sanitized** at the boundary; Ollama internals never proxied to the client. Each error carries an `X-Request-ID` for correlation. |
|
|
| Audit log tampering | Append-only at the app layer; DB role separation; optional WAL archiving. |
|
|
| Prompt data leakage | Prompt logging **off by default**; opt-in per key; TTL'd retention; redaction hook. |
|
|
| Redis outage causing "fail open" | **Fail-closed:** if the rate-limit/budget backend is unavailable, deny (503), not allow. |
|
|
| Compromised admin token | There is **no admin endpoint** in the gateway. Admin lives in `neuronetz-console`; the gateway has nothing to compromise here. |
|
|
|
|
---
|
|
|
|
## Notes on selected defenses
|
|
|
|
### `allow_all_models` is an audited opt-in
|
|
|
|
`allow_all_models` lets a tenant use any currently-installed model, so models newly pulled
|
|
into Ollama are auto-granted on the next discovery refresh. This is convenient but widens the
|
|
attack surface for *that tenant*, so it is:
|
|
|
|
- **opt-in per tenant** (default `false`), set explicitly via the CLI
|
|
(`create-tenant --allow-all-models` or `set-models --allow-all`);
|
|
- **overridable per key** — a non-`NULL` key-level `allow_all_models` overrides the tenant
|
|
flag; otherwise the tenant flag applies (SPEC §13.7);
|
|
- **audited** — every request records the model used in `gateway.audit_log`.
|
|
|
|
Default-deny tenants instead see only `allowed_models ∩ discovered`. Either way the effective
|
|
set is always intersected with the *live* discovered set, so stale or typo'd allowlist entries
|
|
never resolve.
|
|
|
|
### No existence disclosure
|
|
|
|
A model that is installed-but-unpermitted and a model that is not installed both return the
|
|
**same** generic `403`. An attacker cannot use the gateway to enumerate which models exist on
|
|
the backend (SPEC §13.6).
|
|
|
|
### Sanitized errors + request IDs
|
|
|
|
Clients never receive Ollama's error text, stack traces, or internal hostnames. Errors are
|
|
mapped to generic `4xx`/`5xx` JSON with a `request_id`. Operators correlate that ID with the
|
|
audit log to investigate without leaking internals to callers (SPEC §4.3 step 14).
|
|
|
|
### Streaming integrity is also a safety property
|
|
|
|
Token counting and audit writes happen **after** stream close, never on the hot path. This
|
|
keeps time-to-first-byte honest and ensures budget decrements and audit rows reflect the true
|
|
final token counts reported by Ollama (`prompt_eval_count` + `eval_count`), not estimates.
|
|
|
|
---
|
|
|
|
## Out of scope (v0.1.0)
|
|
|
|
Documented as future work, **not** mitigations present today: content moderation /
|
|
prompt-injection filtering, response caching, multi-backend routing, billing, SSO/OAuth2 for
|
|
admin, and any web admin UI (that lives in `neuronetz-console`).
|