Files
Stephan Berbig b47a09db91 demo + playground + docs
One-command demo so the gateway can be exercised end-to-end without a GPU or a
real model download:

- demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags,
  /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count
  and eval_count on the final frame, /api/embed, /api/show, /api/version).
  Non-root multi-stage Dockerfile, never published (internal network only).
- docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with
  PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground.
  Mirrors the prod posture (mock-ollama not exposed).
- demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with
  allow_all_models and a fresh API key via the bootstrap CLI inside the
  container, then prints the key, the playground URL, and five ready-to-paste
  curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull).
  ./demo.sh --down tears everything back down with volumes.
- playground/index.html — single-file dark-themed UI served same-origin by
  the gateway at /playground (CORS-free). Per-endpoint About card with method/
  auth/streaming badges, a real description, sample request body, sample
  response, and a footer note. Live SSE/NDJSON rendering of the response.
  A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh
  are visibly gated until an API key is in the field; the Base URL is
  force-pinned to location.origin three times to defeat browser autofill.
- docs/ — API.md (full endpoint reference with curl, streaming formats, error
  model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery
  + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule,
  pointing at a real Ollama backend, env reference), THREAT_MODEL.md
  (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md
  (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md.
  mkdocs.yml (Material theme) wires them together.
2026-05-26 20:52:33 +02:00

8.6 KiB

neuronetz-gateway — API Reference

The gateway exposes two compatible API surfaces in front of the Ollama backend:

  • Native Ollama under /api/* — NDJSON streaming, identical request shapes to Ollama.
  • OpenAI-compatible under /v1/* — SSE streaming, drop-in for the OpenAI SDKs.

Plus unauthenticated health endpoints. Everything else is blocked.

Source of truth: scope-docs/SPEC.md §6. Where this doc and the SPEC disagree, the SPEC wins.


Authentication

Every model endpoint requires an API key as a Bearer token:

Authorization: Bearer nz_<12-char-prefix><32-char-random>
  • Key format: nz_ namespace + random base62 body. The first 12 characters (nz_ + entropy) are the prefix, stored in cleartext and indexed for O(1) lookup. The full key is argon2id-hashed; it is shown exactly once at creation (neuronetz-gateway create-key) and never stored or logged.
  • Fail-closed: a missing, malformed, expired, disabled, or revoked key returns 401. No upstream/Ollama detail is ever leaked in the error.
  • Health endpoints (/healthz, /readyz) require no auth.

The placeholder key nz_demoKEY... is used throughout this doc. ./demo.sh prints a real key for the local demo.


Response headers (SPEC §6.5)

Every proxied response carries:

Header Meaning
X-Request-ID Correlates the response with the audit log row. Present on errors too.
X-RateLimit-Limit-Requests Effective RPM limit for this key/tenant.
X-RateLimit-Remaining-Requests Requests remaining in the current window.
X-RateLimit-Limit-Tokens Effective TPM limit.
X-RateLimit-Remaining-Tokens Tokens remaining in the current window.
X-Budget-Period day | month | total — the binding budget period.
X-Budget-Tokens-Remaining Tokens left in the binding budget period.

429 Too Many Requests responses additionally carry Retry-After: <seconds>.


Error model

Errors are sanitized at the gateway boundary — Ollama internals are never reflected. The body is a small generic JSON object and the X-Request-ID header ties it to the audit log.

{ "error": { "message": "forbidden", "type": "forbidden", "code": 403 }, "request_id": "b3f1…" }
Status When
400 Malformed body, schema violation, or num_predict over the cap.
401 Missing / invalid / expired / revoked key.
403 Endpoint hard-blocked, or model outside the tenant's effective set (no existence disclosure).
413 Request body over MAX_REQUEST_BODY_BYTES (default 256 KiB).
429 Rate limit or budget exceeded (carries Retry-After).
502 Ollama upstream unreachable / circuit breaker open.
503 A required subsystem (Postgres read, Redis) is down — fail-closed, never "allow".

A model that is installed-but-unpermitted and a model that is not installed return the same generic 403, to prevent enumeration (SPEC §13.6).


Native Ollama endpoints (/api/*)

POST /api/chat

Streamed (NDJSON, default) or non-streamed chat completion.

curl -N http://localhost:8080/api/chat \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1:8b","stream":true,
       "messages":[{"role":"user","content":"Say hello in one sentence."}]}'

Streaming responseContent-Type: application/x-ndjson, one JSON object per line:

{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":"Echo:"},"done":false}
{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":" Say"},"done":false}
…
{"model":"llama3.1:8b","done":true,"done_reason":"stop",
 "prompt_eval_count":6,"eval_count":7,"total_duration":1234567890,"eval_duration":34567890}

The final object carries prompt_eval_count (tokens in) and eval_count (tokens out); the gateway uses these for precise token accounting (SPEC §4.3 step 12).

Non-streaming ("stream": false) returns a single JSON object of the same shape with "done": true.

POST /api/generate

Same semantics as /api/chat but uses a flat prompt string and returns response fields instead of message objects.

curl -N http://localhost:8080/api/generate \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1:8b","stream":true,"prompt":"Write a haiku about routers."}'

POST /api/embed / POST /api/embeddings

Non-streamed embeddings. /api/embed is the newer batch endpoint (field embeddings, a list of vectors); /api/embeddings is the legacy single-vector endpoint (field embedding). Ollama returns no eval_count for embeddings; cost is charged on prompt_eval_count only (SPEC §13.1).

curl http://localhost:8080/api/embed \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" \
  -d '{"model":"nomic-embed-text","input":["hello","world"]}'
{ "model": "nomic-embed-text", "embeddings": [[0.0, 0.1, ], [0.0, 0.1, ]], "prompt_eval_count": 2 }

GET /api/tags

Returns the tenant's effective model set — the live-discovered set intersected with the tenant's allowlist, or all discovered models when allow_all_models is on. Sourced from discovery (SPEC §4.6), never a static list.

curl http://localhost:8080/api/tags -H "Authorization: Bearer nz_demoKEY..."

POST /api/show

Allowed only for models in the effective set; returns sanitized model info. The system prompt and template that Ollama returns are stripped by the gateway.

GET /api/version

Returns the gateway version, not the Ollama version.

{ "version": "0.1.0" }

Hard-blocked endpoints (always 403)

These model-mutating endpoints are blocked at the gateway. Not configurable, not behind a flag (SPEC §6.2, AGENT_PROMPT non-negotiable #5):

/api/pull   /api/push   /api/create   /api/copy   /api/delete   /api/blobs/*
# Always 403, even with a valid key:
curl -i http://localhost:8080/api/pull \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" -d '{"model":"llama3.1:8b"}'

GET /api/ps is also blocked (it would leak which models are loaded).


OpenAI-compatible endpoints (/v1/*)

Path Method Maps to
/v1/chat/completions POST /api/chat
/v1/completions POST /api/generate
/v1/embeddings POST /api/embed
/v1/models GET /api/tags (effective set, OpenAI list format)

Streaming uses SSE: data: {…}\n\n events terminated by a literal data: [DONE]\n\n.

POST /v1/chat/completions

curl -N http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1:8b","stream":true,
       "messages":[{"role":"user","content":"Say hello in one sentence."}]}'

Streaming responseContent-Type: text/event-stream:

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Echo:"},"finish_reason":null}]}

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" Say"},"finish_reason":null}]}

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":6,"completion_tokens":7,"total_tokens":13}}

data: [DONE]

Works with the OpenAI Python SDK by pointing base_url at http://localhost:8080/v1.

GET /v1/models

curl http://localhost:8080/v1/models -H "Authorization: Bearer nz_demoKEY..."
{ "object": "list", "data": [
  { "id": "llama3.1:8b", "object": "model", "owned_by": "neuronetz" },
  { "id": "mistral:7b",  "object": "model", "owned_by": "neuronetz" }
] }

Health endpoints

Path Method Auth Purpose
/healthz GET none Liveness — process responsive (200).
/readyz GET none Readiness — DB + Redis + Ollama reachable, else 503.
/metrics GET none (loopback only) Prometheus exposition.
curl -i http://localhost:8080/healthz   # 200 {"status":"ok"}
curl -i http://localhost:8080/readyz    # 200 when all deps up; 503 otherwise

Quick reference: streaming formats

Surface Content-Type Frame Terminator
Native /api/* application/x-ndjson one JSON object per \n final object has "done": true
OpenAI /v1/* text/event-stream data: {…}\n\n data: [DONE]\n\n