Files
neuronetz-gateway/docs/API.md
Stephan Berbig b47a09db91 demo + playground + docs
One-command demo so the gateway can be exercised end-to-end without a GPU or a
real model download:

- demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags,
  /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count
  and eval_count on the final frame, /api/embed, /api/show, /api/version).
  Non-root multi-stage Dockerfile, never published (internal network only).
- docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with
  PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground.
  Mirrors the prod posture (mock-ollama not exposed).
- demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with
  allow_all_models and a fresh API key via the bootstrap CLI inside the
  container, then prints the key, the playground URL, and five ready-to-paste
  curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull).
  ./demo.sh --down tears everything back down with volumes.
- playground/index.html — single-file dark-themed UI served same-origin by
  the gateway at /playground (CORS-free). Per-endpoint About card with method/
  auth/streaming badges, a real description, sample request body, sample
  response, and a footer note. Live SSE/NDJSON rendering of the response.
  A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh
  are visibly gated until an API key is in the field; the Base URL is
  force-pinned to location.origin three times to defeat browser autofill.
- docs/ — API.md (full endpoint reference with curl, streaming formats, error
  model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery
  + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule,
  pointing at a real Ollama backend, env reference), THREAT_MODEL.md
  (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md
  (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md.
  mkdocs.yml (Material theme) wires them together.
2026-05-26 20:52:33 +02:00

254 lines
8.6 KiB
Markdown

# neuronetz-gateway — API Reference
The gateway exposes two compatible API surfaces in front of the Ollama backend:
- **Native Ollama** under `/api/*` — NDJSON streaming, identical request shapes to Ollama.
- **OpenAI-compatible** under `/v1/*` — SSE streaming, drop-in for the OpenAI SDKs.
Plus unauthenticated health endpoints. Everything else is blocked.
> Source of truth: [`scope-docs/SPEC.md`](../scope-docs/SPEC.md) §6. Where this doc and the
> SPEC disagree, the SPEC wins.
---
## Authentication
Every model endpoint requires an API key as a Bearer token:
```
Authorization: Bearer nz_<12-char-prefix><32-char-random>
```
- **Key format:** `nz_` namespace + random base62 body. The first 12 characters
(`nz_` + entropy) are the **prefix**, stored in cleartext and indexed for O(1) lookup.
The full key is **argon2id**-hashed; it is shown **exactly once** at creation
(`neuronetz-gateway create-key`) and never stored or logged.
- **Fail-closed:** a missing, malformed, expired, disabled, or revoked key returns **401**.
No upstream/Ollama detail is ever leaked in the error.
- Health endpoints (`/healthz`, `/readyz`) require **no** auth.
The placeholder key `nz_demoKEY...` is used throughout this doc. `./demo.sh` prints a
**real** key for the local demo.
---
## Response headers (SPEC §6.5)
Every proxied response carries:
| Header | Meaning |
|---|---|
| `X-Request-ID` | Correlates the response with the audit log row. Present on errors too. |
| `X-RateLimit-Limit-Requests` | Effective RPM limit for this key/tenant. |
| `X-RateLimit-Remaining-Requests` | Requests remaining in the current window. |
| `X-RateLimit-Limit-Tokens` | Effective TPM limit. |
| `X-RateLimit-Remaining-Tokens` | Tokens remaining in the current window. |
| `X-Budget-Period` | `day` \| `month` \| `total` — the binding budget period. |
| `X-Budget-Tokens-Remaining` | Tokens left in the binding budget period. |
`429 Too Many Requests` responses additionally carry `Retry-After: <seconds>`.
---
## Error model
Errors are **sanitized** at the gateway boundary — Ollama internals are never reflected.
The body is a small generic JSON object and the `X-Request-ID` header ties it to the audit log.
```json
{ "error": { "message": "forbidden", "type": "forbidden", "code": 403 }, "request_id": "b3f1…" }
```
| Status | When |
|---|---|
| `400` | Malformed body, schema violation, or `num_predict` over the cap. |
| `401` | Missing / invalid / expired / revoked key. |
| `403` | Endpoint hard-blocked, or model outside the tenant's effective set (no existence disclosure). |
| `413` | Request body over `MAX_REQUEST_BODY_BYTES` (default 256 KiB). |
| `429` | Rate limit or budget exceeded (carries `Retry-After`). |
| `502` | Ollama upstream unreachable / circuit breaker open. |
| `503` | A required subsystem (Postgres read, Redis) is down — **fail-closed**, never "allow". |
A model that is *installed-but-unpermitted* and a model that is *not installed* return the
**same** generic `403`, to prevent enumeration (SPEC §13.6).
---
## Native Ollama endpoints (`/api/*`)
### `POST /api/chat`
Streamed (NDJSON, default) or non-streamed chat completion.
```bash
curl -N http://localhost:8080/api/chat \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" \
-d '{"model":"llama3.1:8b","stream":true,
"messages":[{"role":"user","content":"Say hello in one sentence."}]}'
```
**Streaming response**`Content-Type: application/x-ndjson`, one JSON object per line:
```
{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":"Echo:"},"done":false}
{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":" Say"},"done":false}
{"model":"llama3.1:8b","done":true,"done_reason":"stop",
"prompt_eval_count":6,"eval_count":7,"total_duration":1234567890,"eval_duration":34567890}
```
The **final** object carries `prompt_eval_count` (tokens in) and `eval_count` (tokens out);
the gateway uses these for precise token accounting (SPEC §4.3 step 12).
**Non-streaming** (`"stream": false`) returns a single JSON object of the same shape with
`"done": true`.
### `POST /api/generate`
Same semantics as `/api/chat` but uses a flat `prompt` string and returns `response`
fields instead of `message` objects.
```bash
curl -N http://localhost:8080/api/generate \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" \
-d '{"model":"llama3.1:8b","stream":true,"prompt":"Write a haiku about routers."}'
```
### `POST /api/embed` / `POST /api/embeddings`
Non-streamed embeddings. `/api/embed` is the newer batch endpoint (field `embeddings`,
a list of vectors); `/api/embeddings` is the legacy single-vector endpoint (field
`embedding`). Ollama returns no `eval_count` for embeddings; cost is charged on
`prompt_eval_count` only (SPEC §13.1).
```bash
curl http://localhost:8080/api/embed \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" \
-d '{"model":"nomic-embed-text","input":["hello","world"]}'
```
```json
{ "model": "nomic-embed-text", "embeddings": [[0.0, 0.1, ], [0.0, 0.1, ]], "prompt_eval_count": 2 }
```
### `GET /api/tags`
Returns the tenant's **effective** model set — the live-discovered set intersected with the
tenant's allowlist, or *all* discovered models when `allow_all_models` is on. Sourced from
discovery (SPEC §4.6), never a static list.
```bash
curl http://localhost:8080/api/tags -H "Authorization: Bearer nz_demoKEY..."
```
### `POST /api/show`
Allowed only for models in the effective set; returns **sanitized** model info.
The system prompt and template that Ollama returns are **stripped** by the gateway.
### `GET /api/version`
Returns the **gateway** version, not the Ollama version.
```json
{ "version": "0.1.0" }
```
---
## Hard-blocked endpoints (always `403`)
These model-mutating endpoints are blocked at the gateway. **Not configurable, not behind a
flag** (SPEC §6.2, AGENT_PROMPT non-negotiable #5):
```
/api/pull /api/push /api/create /api/copy /api/delete /api/blobs/*
```
```bash
# Always 403, even with a valid key:
curl -i http://localhost:8080/api/pull \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" -d '{"model":"llama3.1:8b"}'
```
`GET /api/ps` is also blocked (it would leak which models are loaded).
---
## OpenAI-compatible endpoints (`/v1/*`)
| Path | Method | Maps to |
|---|---|---|
| `/v1/chat/completions` | POST | `/api/chat` |
| `/v1/completions` | POST | `/api/generate` |
| `/v1/embeddings` | POST | `/api/embed` |
| `/v1/models` | GET | `/api/tags` (effective set, OpenAI list format) |
Streaming uses **SSE**: `data: {…}\n\n` events terminated by a literal `data: [DONE]\n\n`.
### `POST /v1/chat/completions`
```bash
curl -N http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer nz_demoKEY..." \
-H "Content-Type: application/json" \
-d '{"model":"llama3.1:8b","stream":true,
"messages":[{"role":"user","content":"Say hello in one sentence."}]}'
```
**Streaming response**`Content-Type: text/event-stream`:
```
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Echo:"},"finish_reason":null}]}
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" Say"},"finish_reason":null}]}
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":6,"completion_tokens":7,"total_tokens":13}}
data: [DONE]
```
Works with the OpenAI Python SDK by pointing `base_url` at `http://localhost:8080/v1`.
### `GET /v1/models`
```bash
curl http://localhost:8080/v1/models -H "Authorization: Bearer nz_demoKEY..."
```
```json
{ "object": "list", "data": [
{ "id": "llama3.1:8b", "object": "model", "owned_by": "neuronetz" },
{ "id": "mistral:7b", "object": "model", "owned_by": "neuronetz" }
] }
```
---
## Health endpoints
| Path | Method | Auth | Purpose |
|---|---|---|---|
| `/healthz` | GET | none | Liveness — process responsive (`200`). |
| `/readyz` | GET | none | Readiness — DB + Redis + Ollama reachable, else `503`. |
| `/metrics` | GET | none (loopback only) | Prometheus exposition. |
```bash
curl -i http://localhost:8080/healthz # 200 {"status":"ok"}
curl -i http://localhost:8080/readyz # 200 when all deps up; 503 otherwise
```
---
## Quick reference: streaming formats
| Surface | Content-Type | Frame | Terminator |
|---|---|---|---|
| Native `/api/*` | `application/x-ndjson` | one JSON object per `\n` | final object has `"done": true` |
| OpenAI `/v1/*` | `text/event-stream` | `data: {…}\n\n` | `data: [DONE]\n\n` |