Files

Stephan Berbig b47a09db91 demo + playground + docs

One-command demo so the gateway can be exercised end-to-end without a GPU or a
real model download:

- demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags,
  /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count
  and eval_count on the final frame, /api/embed, /api/show, /api/version).
  Non-root multi-stage Dockerfile, never published (internal network only).
- docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with
  PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground.
  Mirrors the prod posture (mock-ollama not exposed).
- demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with
  allow_all_models and a fresh API key via the bootstrap CLI inside the
  container, then prints the key, the playground URL, and five ready-to-paste
  curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull).
  ./demo.sh --down tears everything back down with volumes.
- playground/index.html — single-file dark-themed UI served same-origin by
  the gateway at /playground (CORS-free). Per-endpoint About card with method/
  auth/streaming badges, a real description, sample request body, sample
  response, and a footer note. Live SSE/NDJSON rendering of the response.
  A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh
  are visibly gated until an API key is in the field; the Base URL is
  force-pinned to location.origin three times to defeat browser autofill.
- docs/ — API.md (full endpoint reference with curl, streaming formats, error
  model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery
  + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule,
  pointing at a real Ollama backend, env reference), THREAT_MODEL.md
  (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md
  (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md.
  mkdocs.yml (Material theme) wires them together.

2026-05-26 20:52:33 +02:00

8.6 KiB

Raw Permalink Blame History

neuronetz-gateway — API Reference

The gateway exposes two compatible API surfaces in front of the Ollama backend:

Native Ollama under /api/* — NDJSON streaming, identical request shapes to Ollama.
OpenAI-compatible under /v1/* — SSE streaming, drop-in for the OpenAI SDKs.

Plus unauthenticated health endpoints. Everything else is blocked.

Source of truth: scope-docs/SPEC.md §6. Where this doc and the SPEC disagree, the SPEC wins.

Authentication

Every model endpoint requires an API key as a Bearer token:

Authorization: Bearer nz_<12-char-prefix><32-char-random>

Key format: nz_ namespace + random base62 body. The first 12 characters (nz_ + entropy) are the prefix, stored in cleartext and indexed for O(1) lookup. The full key is argon2id-hashed; it is shown exactly once at creation (neuronetz-gateway create-key) and never stored or logged.
Fail-closed: a missing, malformed, expired, disabled, or revoked key returns 401. No upstream/Ollama detail is ever leaked in the error.
Health endpoints (/healthz, /readyz) require no auth.

The placeholder key nz_demoKEY... is used throughout this doc. ./demo.sh prints a real key for the local demo.

Response headers (SPEC §6.5)

Every proxied response carries:

Header	Meaning
`X-Request-ID`	Correlates the response with the audit log row. Present on errors too.
`X-RateLimit-Limit-Requests`	Effective RPM limit for this key/tenant.
`X-RateLimit-Remaining-Requests`	Requests remaining in the current window.
`X-RateLimit-Limit-Tokens`	Effective TPM limit.
`X-RateLimit-Remaining-Tokens`	Tokens remaining in the current window.
`X-Budget-Period`	`day` \| `month` \| `total` — the binding budget period.
`X-Budget-Tokens-Remaining`	Tokens left in the binding budget period.

429 Too Many Requests responses additionally carry Retry-After: <seconds>.

Error model

Errors are sanitized at the gateway boundary — Ollama internals are never reflected. The body is a small generic JSON object and the X-Request-ID header ties it to the audit log.

{ "error": { "message": "forbidden", "type": "forbidden", "code": 403 }, "request_id": "b3f1…" }

Status	When
`400`	Malformed body, schema violation, or `num_predict` over the cap.
`401`	Missing / invalid / expired / revoked key.
`403`	Endpoint hard-blocked, or model outside the tenant's effective set (no existence disclosure).
`413`	Request body over `MAX_REQUEST_BODY_BYTES` (default 256 KiB).
`429`	Rate limit or budget exceeded (carries `Retry-After`).
`502`	Ollama upstream unreachable / circuit breaker open.
`503`	A required subsystem (Postgres read, Redis) is down — fail-closed, never "allow".

A model that is installed-but-unpermitted and a model that is not installed return the same generic 403, to prevent enumeration (SPEC §13.6).

Native Ollama endpoints (`/api/*`)

`POST /api/chat`

Streamed (NDJSON, default) or non-streamed chat completion.

curl -N http://localhost:8080/api/chat \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1:8b","stream":true,
       "messages":[{"role":"user","content":"Say hello in one sentence."}]}'

Streaming response — Content-Type: application/x-ndjson, one JSON object per line:

{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":"Echo:"},"done":false}
{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":" Say"},"done":false}
…
{"model":"llama3.1:8b","done":true,"done_reason":"stop",
 "prompt_eval_count":6,"eval_count":7,"total_duration":1234567890,"eval_duration":34567890}

The final object carries prompt_eval_count (tokens in) and eval_count (tokens out); the gateway uses these for precise token accounting (SPEC §4.3 step 12).

Non-streaming ("stream": false) returns a single JSON object of the same shape with "done": true.

`POST /api/generate`

Same semantics as /api/chat but uses a flat prompt string and returns response fields instead of message objects.

curl -N http://localhost:8080/api/generate \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1:8b","stream":true,"prompt":"Write a haiku about routers."}'

`POST /api/embed` / `POST /api/embeddings`

Non-streamed embeddings. /api/embed is the newer batch endpoint (field embeddings, a list of vectors); /api/embeddings is the legacy single-vector endpoint (field embedding). Ollama returns no eval_count for embeddings; cost is charged on prompt_eval_count only (SPEC §13.1).

curl http://localhost:8080/api/embed \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" \
  -d '{"model":"nomic-embed-text","input":["hello","world"]}'

{ "model": "nomic-embed-text", "embeddings": [[0.0, 0.1, …], [0.0, 0.1, …]], "prompt_eval_count": 2 }

`GET /api/tags`

Returns the tenant's effective model set — the live-discovered set intersected with the tenant's allowlist, or all discovered models when allow_all_models is on. Sourced from discovery (SPEC §4.6), never a static list.

curl http://localhost:8080/api/tags -H "Authorization: Bearer nz_demoKEY..."

`POST /api/show`

Allowed only for models in the effective set; returns sanitized model info. The system prompt and template that Ollama returns are stripped by the gateway.

`GET /api/version`

Returns the gateway version, not the Ollama version.

{ "version": "0.1.0" }

Hard-blocked endpoints (always `403`)

These model-mutating endpoints are blocked at the gateway. Not configurable, not behind a flag (SPEC §6.2, AGENT_PROMPT non-negotiable #5):

/api/pull   /api/push   /api/create   /api/copy   /api/delete   /api/blobs/*

# Always 403, even with a valid key:
curl -i http://localhost:8080/api/pull \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" -d '{"model":"llama3.1:8b"}'

GET /api/ps is also blocked (it would leak which models are loaded).

OpenAI-compatible endpoints (`/v1/*`)

Path	Method	Maps to
`/v1/chat/completions`	POST	`/api/chat`
`/v1/completions`	POST	`/api/generate`
`/v1/embeddings`	POST	`/api/embed`
`/v1/models`	GET	`/api/tags` (effective set, OpenAI list format)

Streaming uses SSE: data: {…}\n\n events terminated by a literal data: [DONE]\n\n.

`POST /v1/chat/completions`

curl -N http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1:8b","stream":true,
       "messages":[{"role":"user","content":"Say hello in one sentence."}]}'

Streaming response — Content-Type: text/event-stream:

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Echo:"},"finish_reason":null}]}

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" Say"},"finish_reason":null}]}

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":6,"completion_tokens":7,"total_tokens":13}}

data: [DONE]

Works with the OpenAI Python SDK by pointing base_url at http://localhost:8080/v1.

`GET /v1/models`

curl http://localhost:8080/v1/models -H "Authorization: Bearer nz_demoKEY..."

{ "object": "list", "data": [
  { "id": "llama3.1:8b", "object": "model", "owned_by": "neuronetz" },
  { "id": "mistral:7b",  "object": "model", "owned_by": "neuronetz" }
] }

Health endpoints

Path	Method	Auth	Purpose
`/healthz`	GET	none	Liveness — process responsive (`200`).
`/readyz`	GET	none	Readiness — DB + Redis + Ollama reachable, else `503`.
`/metrics`	GET	none (loopback only)	Prometheus exposition.

curl -i http://localhost:8080/healthz   # 200 {"status":"ok"}
curl -i http://localhost:8080/readyz    # 200 when all deps up; 503 otherwise

Quick reference: streaming formats

Surface	Content-Type	Frame	Terminator
Native `/api/*`	`application/x-ndjson`	one JSON object per `\n`	final object has `"done": true`
OpenAI `/v1/*`	`text/event-stream`	`data: {…}\n\n`	`data: [DONE]\n\n`

8.6 KiB Raw Permalink Blame History

neuronetz-gateway — API Reference

Authentication

Response headers (SPEC §6.5)

Error model

Native Ollama endpoints (/api/*)

POST /api/chat

POST /api/generate

POST /api/embed / POST /api/embeddings

GET /api/tags

POST /api/show

GET /api/version

Hard-blocked endpoints (always 403)

OpenAI-compatible endpoints (/v1/*)

POST /v1/chat/completions

GET /v1/models