# neuronetz-gateway — API Reference The gateway exposes two compatible API surfaces in front of the Ollama backend: - **Native Ollama** under `/api/*` — NDJSON streaming, identical request shapes to Ollama. - **OpenAI-compatible** under `/v1/*` — SSE streaming, drop-in for the OpenAI SDKs. Plus unauthenticated health endpoints. Everything else is blocked. > Source of truth: [`scope-docs/SPEC.md`](../scope-docs/SPEC.md) §6. Where this doc and the > SPEC disagree, the SPEC wins. --- ## Authentication Every model endpoint requires an API key as a Bearer token: ``` Authorization: Bearer nz_<12-char-prefix><32-char-random> ``` - **Key format:** `nz_` namespace + random base62 body. The first 12 characters (`nz_` + entropy) are the **prefix**, stored in cleartext and indexed for O(1) lookup. The full key is **argon2id**-hashed; it is shown **exactly once** at creation (`neuronetz-gateway create-key`) and never stored or logged. - **Fail-closed:** a missing, malformed, expired, disabled, or revoked key returns **401**. No upstream/Ollama detail is ever leaked in the error. - Health endpoints (`/healthz`, `/readyz`) require **no** auth. The placeholder key `nz_demoKEY...` is used throughout this doc. `./demo.sh` prints a **real** key for the local demo. --- ## Response headers (SPEC §6.5) Every proxied response carries: | Header | Meaning | |---|---| | `X-Request-ID` | Correlates the response with the audit log row. Present on errors too. | | `X-RateLimit-Limit-Requests` | Effective RPM limit for this key/tenant. | | `X-RateLimit-Remaining-Requests` | Requests remaining in the current window. | | `X-RateLimit-Limit-Tokens` | Effective TPM limit. | | `X-RateLimit-Remaining-Tokens` | Tokens remaining in the current window. | | `X-Budget-Period` | `day` \| `month` \| `total` — the binding budget period. | | `X-Budget-Tokens-Remaining` | Tokens left in the binding budget period. | `429 Too Many Requests` responses additionally carry `Retry-After: `. --- ## Error model Errors are **sanitized** at the gateway boundary — Ollama internals are never reflected. The body is a small generic JSON object and the `X-Request-ID` header ties it to the audit log. ```json { "error": { "message": "forbidden", "type": "forbidden", "code": 403 }, "request_id": "b3f1…" } ``` | Status | When | |---|---| | `400` | Malformed body, schema violation, or `num_predict` over the cap. | | `401` | Missing / invalid / expired / revoked key. | | `403` | Endpoint hard-blocked, or model outside the tenant's effective set (no existence disclosure). | | `413` | Request body over `MAX_REQUEST_BODY_BYTES` (default 256 KiB). | | `429` | Rate limit or budget exceeded (carries `Retry-After`). | | `502` | Ollama upstream unreachable / circuit breaker open. | | `503` | A required subsystem (Postgres read, Redis) is down — **fail-closed**, never "allow". | A model that is *installed-but-unpermitted* and a model that is *not installed* return the **same** generic `403`, to prevent enumeration (SPEC §13.6). --- ## Native Ollama endpoints (`/api/*`) ### `POST /api/chat` Streamed (NDJSON, default) or non-streamed chat completion. ```bash curl -N http://localhost:8080/api/chat \ -H "Authorization: Bearer nz_demoKEY..." \ -H "Content-Type: application/json" \ -d '{"model":"llama3.1:8b","stream":true, "messages":[{"role":"user","content":"Say hello in one sentence."}]}' ``` **Streaming response** — `Content-Type: application/x-ndjson`, one JSON object per line: ``` {"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":"Echo:"},"done":false} {"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":" Say"},"done":false} … {"model":"llama3.1:8b","done":true,"done_reason":"stop", "prompt_eval_count":6,"eval_count":7,"total_duration":1234567890,"eval_duration":34567890} ``` The **final** object carries `prompt_eval_count` (tokens in) and `eval_count` (tokens out); the gateway uses these for precise token accounting (SPEC §4.3 step 12). **Non-streaming** (`"stream": false`) returns a single JSON object of the same shape with `"done": true`. ### `POST /api/generate` Same semantics as `/api/chat` but uses a flat `prompt` string and returns `response` fields instead of `message` objects. ```bash curl -N http://localhost:8080/api/generate \ -H "Authorization: Bearer nz_demoKEY..." \ -H "Content-Type: application/json" \ -d '{"model":"llama3.1:8b","stream":true,"prompt":"Write a haiku about routers."}' ``` ### `POST /api/embed` / `POST /api/embeddings` Non-streamed embeddings. `/api/embed` is the newer batch endpoint (field `embeddings`, a list of vectors); `/api/embeddings` is the legacy single-vector endpoint (field `embedding`). Ollama returns no `eval_count` for embeddings; cost is charged on `prompt_eval_count` only (SPEC §13.1). ```bash curl http://localhost:8080/api/embed \ -H "Authorization: Bearer nz_demoKEY..." \ -H "Content-Type: application/json" \ -d '{"model":"nomic-embed-text","input":["hello","world"]}' ``` ```json { "model": "nomic-embed-text", "embeddings": [[0.0, 0.1, …], [0.0, 0.1, …]], "prompt_eval_count": 2 } ``` ### `GET /api/tags` Returns the tenant's **effective** model set — the live-discovered set intersected with the tenant's allowlist, or *all* discovered models when `allow_all_models` is on. Sourced from discovery (SPEC §4.6), never a static list. ```bash curl http://localhost:8080/api/tags -H "Authorization: Bearer nz_demoKEY..." ``` ### `POST /api/show` Allowed only for models in the effective set; returns **sanitized** model info. The system prompt and template that Ollama returns are **stripped** by the gateway. ### `GET /api/version` Returns the **gateway** version, not the Ollama version. ```json { "version": "0.1.0" } ``` --- ## Hard-blocked endpoints (always `403`) These model-mutating endpoints are blocked at the gateway. **Not configurable, not behind a flag** (SPEC §6.2, AGENT_PROMPT non-negotiable #5): ``` /api/pull /api/push /api/create /api/copy /api/delete /api/blobs/* ``` ```bash # Always 403, even with a valid key: curl -i http://localhost:8080/api/pull \ -H "Authorization: Bearer nz_demoKEY..." \ -H "Content-Type: application/json" -d '{"model":"llama3.1:8b"}' ``` `GET /api/ps` is also blocked (it would leak which models are loaded). --- ## OpenAI-compatible endpoints (`/v1/*`) | Path | Method | Maps to | |---|---|---| | `/v1/chat/completions` | POST | `/api/chat` | | `/v1/completions` | POST | `/api/generate` | | `/v1/embeddings` | POST | `/api/embed` | | `/v1/models` | GET | `/api/tags` (effective set, OpenAI list format) | Streaming uses **SSE**: `data: {…}\n\n` events terminated by a literal `data: [DONE]\n\n`. ### `POST /v1/chat/completions` ```bash curl -N http://localhost:8080/v1/chat/completions \ -H "Authorization: Bearer nz_demoKEY..." \ -H "Content-Type: application/json" \ -d '{"model":"llama3.1:8b","stream":true, "messages":[{"role":"user","content":"Say hello in one sentence."}]}' ``` **Streaming response** — `Content-Type: text/event-stream`: ``` data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Echo:"},"finish_reason":null}]} data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" Say"},"finish_reason":null}]} data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":6,"completion_tokens":7,"total_tokens":13}} data: [DONE] ``` Works with the OpenAI Python SDK by pointing `base_url` at `http://localhost:8080/v1`. ### `GET /v1/models` ```bash curl http://localhost:8080/v1/models -H "Authorization: Bearer nz_demoKEY..." ``` ```json { "object": "list", "data": [ { "id": "llama3.1:8b", "object": "model", "owned_by": "neuronetz" }, { "id": "mistral:7b", "object": "model", "owned_by": "neuronetz" } ] } ``` --- ## Health endpoints | Path | Method | Auth | Purpose | |---|---|---|---| | `/healthz` | GET | none | Liveness — process responsive (`200`). | | `/readyz` | GET | none | Readiness — DB + Redis + Ollama reachable, else `503`. | | `/metrics` | GET | none (loopback only) | Prometheus exposition. | ```bash curl -i http://localhost:8080/healthz # 200 {"status":"ok"} curl -i http://localhost:8080/readyz # 200 when all deps up; 503 otherwise ``` --- ## Quick reference: streaming formats | Surface | Content-Type | Frame | Terminator | |---|---|---|---| | Native `/api/*` | `application/x-ndjson` | one JSON object per `\n` | final object has `"done": true` | | OpenAI `/v1/*` | `text/event-stream` | `data: {…}\n\n` | `data: [DONE]\n\n` |