One-command demo so the gateway can be exercised end-to-end without a GPU or a real model download: - demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags, /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count and eval_count on the final frame, /api/embed, /api/show, /api/version). Non-root multi-stage Dockerfile, never published (internal network only). - docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground. Mirrors the prod posture (mock-ollama not exposed). - demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with allow_all_models and a fresh API key via the bootstrap CLI inside the container, then prints the key, the playground URL, and five ready-to-paste curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull). ./demo.sh --down tears everything back down with volumes. - playground/index.html — single-file dark-themed UI served same-origin by the gateway at /playground (CORS-free). Per-endpoint About card with method/ auth/streaming badges, a real description, sample request body, sample response, and a footer note. Live SSE/NDJSON rendering of the response. A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh are visibly gated until an API key is in the field; the Base URL is force-pinned to location.origin three times to defeat browser autofill. - docs/ — API.md (full endpoint reference with curl, streaming formats, error model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule, pointing at a real Ollama backend, env reference), THREAT_MODEL.md (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md. mkdocs.yml (Material theme) wires them together.
8.6 KiB
neuronetz-gateway — API Reference
The gateway exposes two compatible API surfaces in front of the Ollama backend:
- Native Ollama under
/api/*— NDJSON streaming, identical request shapes to Ollama. - OpenAI-compatible under
/v1/*— SSE streaming, drop-in for the OpenAI SDKs.
Plus unauthenticated health endpoints. Everything else is blocked.
Source of truth:
scope-docs/SPEC.md§6. Where this doc and the SPEC disagree, the SPEC wins.
Authentication
Every model endpoint requires an API key as a Bearer token:
Authorization: Bearer nz_<12-char-prefix><32-char-random>
- Key format:
nz_namespace + random base62 body. The first 12 characters (nz_+ entropy) are the prefix, stored in cleartext and indexed for O(1) lookup. The full key is argon2id-hashed; it is shown exactly once at creation (neuronetz-gateway create-key) and never stored or logged. - Fail-closed: a missing, malformed, expired, disabled, or revoked key returns 401. No upstream/Ollama detail is ever leaked in the error.
- Health endpoints (
/healthz,/readyz) require no auth.
The placeholder key nz_demoKEY... is used throughout this doc. ./demo.sh prints a
real key for the local demo.
Response headers (SPEC §6.5)
Every proxied response carries:
| Header | Meaning |
|---|---|
X-Request-ID |
Correlates the response with the audit log row. Present on errors too. |
X-RateLimit-Limit-Requests |
Effective RPM limit for this key/tenant. |
X-RateLimit-Remaining-Requests |
Requests remaining in the current window. |
X-RateLimit-Limit-Tokens |
Effective TPM limit. |
X-RateLimit-Remaining-Tokens |
Tokens remaining in the current window. |
X-Budget-Period |
day | month | total — the binding budget period. |
X-Budget-Tokens-Remaining |
Tokens left in the binding budget period. |
429 Too Many Requests responses additionally carry Retry-After: <seconds>.
Error model
Errors are sanitized at the gateway boundary — Ollama internals are never reflected.
The body is a small generic JSON object and the X-Request-ID header ties it to the audit log.
{ "error": { "message": "forbidden", "type": "forbidden", "code": 403 }, "request_id": "b3f1…" }
| Status | When |
|---|---|
400 |
Malformed body, schema violation, or num_predict over the cap. |
401 |
Missing / invalid / expired / revoked key. |
403 |
Endpoint hard-blocked, or model outside the tenant's effective set (no existence disclosure). |
413 |
Request body over MAX_REQUEST_BODY_BYTES (default 256 KiB). |
429 |
Rate limit or budget exceeded (carries Retry-After). |
502 |
Ollama upstream unreachable / circuit breaker open. |
503 |
A required subsystem (Postgres read, Redis) is down — fail-closed, never "allow". |
A model that is installed-but-unpermitted and a model that is not installed return the
same generic 403, to prevent enumeration (SPEC §13.6).
Native Ollama endpoints (/api/*)
POST /api/chat
Streamed (NDJSON, default) or non-streamed chat completion.
curl -N http://localhost:8080/api/chat \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" \
-d '{"model":"llama3.1:8b","stream":true,
"messages":[{"role":"user","content":"Say hello in one sentence."}]}'
Streaming response — Content-Type: application/x-ndjson, one JSON object per line:
{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":"Echo:"},"done":false}
{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":" Say"},"done":false}
…
{"model":"llama3.1:8b","done":true,"done_reason":"stop",
"prompt_eval_count":6,"eval_count":7,"total_duration":1234567890,"eval_duration":34567890}
The final object carries prompt_eval_count (tokens in) and eval_count (tokens out);
the gateway uses these for precise token accounting (SPEC §4.3 step 12).
Non-streaming ("stream": false) returns a single JSON object of the same shape with
"done": true.
POST /api/generate
Same semantics as /api/chat but uses a flat prompt string and returns response
fields instead of message objects.
curl -N http://localhost:8080/api/generate \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" \
-d '{"model":"llama3.1:8b","stream":true,"prompt":"Write a haiku about routers."}'
POST /api/embed / POST /api/embeddings
Non-streamed embeddings. /api/embed is the newer batch endpoint (field embeddings,
a list of vectors); /api/embeddings is the legacy single-vector endpoint (field
embedding). Ollama returns no eval_count for embeddings; cost is charged on
prompt_eval_count only (SPEC §13.1).
curl http://localhost:8080/api/embed \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" \
-d '{"model":"nomic-embed-text","input":["hello","world"]}'
{ "model": "nomic-embed-text", "embeddings": [[0.0, 0.1, …], [0.0, 0.1, …]], "prompt_eval_count": 2 }
GET /api/tags
Returns the tenant's effective model set — the live-discovered set intersected with the
tenant's allowlist, or all discovered models when allow_all_models is on. Sourced from
discovery (SPEC §4.6), never a static list.
curl http://localhost:8080/api/tags -H "Authorization: Bearer nz_demoKEY..."
POST /api/show
Allowed only for models in the effective set; returns sanitized model info. The system prompt and template that Ollama returns are stripped by the gateway.
GET /api/version
Returns the gateway version, not the Ollama version.
{ "version": "0.1.0" }
Hard-blocked endpoints (always 403)
These model-mutating endpoints are blocked at the gateway. Not configurable, not behind a flag (SPEC §6.2, AGENT_PROMPT non-negotiable #5):
/api/pull /api/push /api/create /api/copy /api/delete /api/blobs/*
# Always 403, even with a valid key:
curl -i http://localhost:8080/api/pull \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" -d '{"model":"llama3.1:8b"}'
GET /api/ps is also blocked (it would leak which models are loaded).
OpenAI-compatible endpoints (/v1/*)
| Path | Method | Maps to |
|---|---|---|
/v1/chat/completions |
POST | /api/chat |
/v1/completions |
POST | /api/generate |
/v1/embeddings |
POST | /api/embed |
/v1/models |
GET | /api/tags (effective set, OpenAI list format) |
Streaming uses SSE: data: {…}\n\n events terminated by a literal data: [DONE]\n\n.
POST /v1/chat/completions
curl -N http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" \
-d '{"model":"llama3.1:8b","stream":true,
"messages":[{"role":"user","content":"Say hello in one sentence."}]}'
Streaming response — Content-Type: text/event-stream:
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Echo:"},"finish_reason":null}]}
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" Say"},"finish_reason":null}]}
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":6,"completion_tokens":7,"total_tokens":13}}
data: [DONE]
Works with the OpenAI Python SDK by pointing base_url at http://localhost:8080/v1.
GET /v1/models
curl http://localhost:8080/v1/models -H "Authorization: Bearer nz_demoKEY..."
{ "object": "list", "data": [
{ "id": "llama3.1:8b", "object": "model", "owned_by": "neuronetz" },
{ "id": "mistral:7b", "object": "model", "owned_by": "neuronetz" }
] }
Health endpoints
| Path | Method | Auth | Purpose |
|---|---|---|---|
/healthz |
GET | none | Liveness — process responsive (200). |
/readyz |
GET | none | Readiness — DB + Redis + Ollama reachable, else 503. |
/metrics |
GET | none (loopback only) | Prometheus exposition. |
curl -i http://localhost:8080/healthz # 200 {"status":"ok"}
curl -i http://localhost:8080/readyz # 200 when all deps up; 503 otherwise
Quick reference: streaming formats
| Surface | Content-Type | Frame | Terminator |
|---|---|---|---|
Native /api/* |
application/x-ndjson |
one JSON object per \n |
final object has "done": true |
OpenAI /v1/* |
text/event-stream |
data: {…}\n\n |
data: [DONE]\n\n |