One-command demo so the gateway can be exercised end-to-end without a GPU or a real model download: - demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags, /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count and eval_count on the final frame, /api/embed, /api/show, /api/version). Non-root multi-stage Dockerfile, never published (internal network only). - docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground. Mirrors the prod posture (mock-ollama not exposed). - demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with allow_all_models and a fresh API key via the bootstrap CLI inside the container, then prints the key, the playground URL, and five ready-to-paste curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull). ./demo.sh --down tears everything back down with volumes. - playground/index.html — single-file dark-themed UI served same-origin by the gateway at /playground (CORS-free). Per-endpoint About card with method/ auth/streaming badges, a real description, sample request body, sample response, and a footer note. Live SSE/NDJSON rendering of the response. A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh are visibly gated until an API key is in the field; the Base URL is force-pinned to location.origin three times to defeat browser autofill. - docs/ — API.md (full endpoint reference with curl, streaming formats, error model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule, pointing at a real Ollama backend, env reference), THREAT_MODEL.md (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md. mkdocs.yml (Material theme) wires them together.
254 lines
8.6 KiB
Markdown
254 lines
8.6 KiB
Markdown
# neuronetz-gateway — API Reference
|
|
|
|
The gateway exposes two compatible API surfaces in front of the Ollama backend:
|
|
|
|
- **Native Ollama** under `/api/*` — NDJSON streaming, identical request shapes to Ollama.
|
|
- **OpenAI-compatible** under `/v1/*` — SSE streaming, drop-in for the OpenAI SDKs.
|
|
|
|
Plus unauthenticated health endpoints. Everything else is blocked.
|
|
|
|
> Source of truth: [`scope-docs/SPEC.md`](../scope-docs/SPEC.md) §6. Where this doc and the
|
|
> SPEC disagree, the SPEC wins.
|
|
|
|
---
|
|
|
|
## Authentication
|
|
|
|
Every model endpoint requires an API key as a Bearer token:
|
|
|
|
```
|
|
Authorization: Bearer nz_<12-char-prefix><32-char-random>
|
|
```
|
|
|
|
- **Key format:** `nz_` namespace + random base62 body. The first 12 characters
|
|
(`nz_` + entropy) are the **prefix**, stored in cleartext and indexed for O(1) lookup.
|
|
The full key is **argon2id**-hashed; it is shown **exactly once** at creation
|
|
(`neuronetz-gateway create-key`) and never stored or logged.
|
|
- **Fail-closed:** a missing, malformed, expired, disabled, or revoked key returns **401**.
|
|
No upstream/Ollama detail is ever leaked in the error.
|
|
- Health endpoints (`/healthz`, `/readyz`) require **no** auth.
|
|
|
|
The placeholder key `nz_demoKEY...` is used throughout this doc. `./demo.sh` prints a
|
|
**real** key for the local demo.
|
|
|
|
---
|
|
|
|
## Response headers (SPEC §6.5)
|
|
|
|
Every proxied response carries:
|
|
|
|
| Header | Meaning |
|
|
|---|---|
|
|
| `X-Request-ID` | Correlates the response with the audit log row. Present on errors too. |
|
|
| `X-RateLimit-Limit-Requests` | Effective RPM limit for this key/tenant. |
|
|
| `X-RateLimit-Remaining-Requests` | Requests remaining in the current window. |
|
|
| `X-RateLimit-Limit-Tokens` | Effective TPM limit. |
|
|
| `X-RateLimit-Remaining-Tokens` | Tokens remaining in the current window. |
|
|
| `X-Budget-Period` | `day` \| `month` \| `total` — the binding budget period. |
|
|
| `X-Budget-Tokens-Remaining` | Tokens left in the binding budget period. |
|
|
|
|
`429 Too Many Requests` responses additionally carry `Retry-After: <seconds>`.
|
|
|
|
---
|
|
|
|
## Error model
|
|
|
|
Errors are **sanitized** at the gateway boundary — Ollama internals are never reflected.
|
|
The body is a small generic JSON object and the `X-Request-ID` header ties it to the audit log.
|
|
|
|
```json
|
|
{ "error": { "message": "forbidden", "type": "forbidden", "code": 403 }, "request_id": "b3f1…" }
|
|
```
|
|
|
|
| Status | When |
|
|
|---|---|
|
|
| `400` | Malformed body, schema violation, or `num_predict` over the cap. |
|
|
| `401` | Missing / invalid / expired / revoked key. |
|
|
| `403` | Endpoint hard-blocked, or model outside the tenant's effective set (no existence disclosure). |
|
|
| `413` | Request body over `MAX_REQUEST_BODY_BYTES` (default 256 KiB). |
|
|
| `429` | Rate limit or budget exceeded (carries `Retry-After`). |
|
|
| `502` | Ollama upstream unreachable / circuit breaker open. |
|
|
| `503` | A required subsystem (Postgres read, Redis) is down — **fail-closed**, never "allow". |
|
|
|
|
A model that is *installed-but-unpermitted* and a model that is *not installed* return the
|
|
**same** generic `403`, to prevent enumeration (SPEC §13.6).
|
|
|
|
---
|
|
|
|
## Native Ollama endpoints (`/api/*`)
|
|
|
|
### `POST /api/chat`
|
|
|
|
Streamed (NDJSON, default) or non-streamed chat completion.
|
|
|
|
```bash
|
|
curl -N http://localhost:8080/api/chat \
|
|
-H "Authorization: Bearer nz_demoKEY..." \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model":"llama3.1:8b","stream":true,
|
|
"messages":[{"role":"user","content":"Say hello in one sentence."}]}'
|
|
```
|
|
|
|
**Streaming response** — `Content-Type: application/x-ndjson`, one JSON object per line:
|
|
|
|
```
|
|
{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":"Echo:"},"done":false}
|
|
{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":" Say"},"done":false}
|
|
…
|
|
{"model":"llama3.1:8b","done":true,"done_reason":"stop",
|
|
"prompt_eval_count":6,"eval_count":7,"total_duration":1234567890,"eval_duration":34567890}
|
|
```
|
|
|
|
The **final** object carries `prompt_eval_count` (tokens in) and `eval_count` (tokens out);
|
|
the gateway uses these for precise token accounting (SPEC §4.3 step 12).
|
|
|
|
**Non-streaming** (`"stream": false`) returns a single JSON object of the same shape with
|
|
`"done": true`.
|
|
|
|
### `POST /api/generate`
|
|
|
|
Same semantics as `/api/chat` but uses a flat `prompt` string and returns `response`
|
|
fields instead of `message` objects.
|
|
|
|
```bash
|
|
curl -N http://localhost:8080/api/generate \
|
|
-H "Authorization: Bearer nz_demoKEY..." \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model":"llama3.1:8b","stream":true,"prompt":"Write a haiku about routers."}'
|
|
```
|
|
|
|
### `POST /api/embed` / `POST /api/embeddings`
|
|
|
|
Non-streamed embeddings. `/api/embed` is the newer batch endpoint (field `embeddings`,
|
|
a list of vectors); `/api/embeddings` is the legacy single-vector endpoint (field
|
|
`embedding`). Ollama returns no `eval_count` for embeddings; cost is charged on
|
|
`prompt_eval_count` only (SPEC §13.1).
|
|
|
|
```bash
|
|
curl http://localhost:8080/api/embed \
|
|
-H "Authorization: Bearer nz_demoKEY..." \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model":"nomic-embed-text","input":["hello","world"]}'
|
|
```
|
|
|
|
```json
|
|
{ "model": "nomic-embed-text", "embeddings": [[0.0, 0.1, …], [0.0, 0.1, …]], "prompt_eval_count": 2 }
|
|
```
|
|
|
|
### `GET /api/tags`
|
|
|
|
Returns the tenant's **effective** model set — the live-discovered set intersected with the
|
|
tenant's allowlist, or *all* discovered models when `allow_all_models` is on. Sourced from
|
|
discovery (SPEC §4.6), never a static list.
|
|
|
|
```bash
|
|
curl http://localhost:8080/api/tags -H "Authorization: Bearer nz_demoKEY..."
|
|
```
|
|
|
|
### `POST /api/show`
|
|
|
|
Allowed only for models in the effective set; returns **sanitized** model info.
|
|
The system prompt and template that Ollama returns are **stripped** by the gateway.
|
|
|
|
### `GET /api/version`
|
|
|
|
Returns the **gateway** version, not the Ollama version.
|
|
|
|
```json
|
|
{ "version": "0.1.0" }
|
|
```
|
|
|
|
---
|
|
|
|
## Hard-blocked endpoints (always `403`)
|
|
|
|
These model-mutating endpoints are blocked at the gateway. **Not configurable, not behind a
|
|
flag** (SPEC §6.2, AGENT_PROMPT non-negotiable #5):
|
|
|
|
```
|
|
/api/pull /api/push /api/create /api/copy /api/delete /api/blobs/*
|
|
```
|
|
|
|
```bash
|
|
# Always 403, even with a valid key:
|
|
curl -i http://localhost:8080/api/pull \
|
|
-H "Authorization: Bearer nz_demoKEY..." \
|
|
-H "Content-Type: application/json" -d '{"model":"llama3.1:8b"}'
|
|
```
|
|
|
|
`GET /api/ps` is also blocked (it would leak which models are loaded).
|
|
|
|
---
|
|
|
|
## OpenAI-compatible endpoints (`/v1/*`)
|
|
|
|
| Path | Method | Maps to |
|
|
|---|---|---|
|
|
| `/v1/chat/completions` | POST | `/api/chat` |
|
|
| `/v1/completions` | POST | `/api/generate` |
|
|
| `/v1/embeddings` | POST | `/api/embed` |
|
|
| `/v1/models` | GET | `/api/tags` (effective set, OpenAI list format) |
|
|
|
|
Streaming uses **SSE**: `data: {…}\n\n` events terminated by a literal `data: [DONE]\n\n`.
|
|
|
|
### `POST /v1/chat/completions`
|
|
|
|
```bash
|
|
curl -N http://localhost:8080/v1/chat/completions \
|
|
-H "Authorization: Bearer nz_demoKEY..." \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model":"llama3.1:8b","stream":true,
|
|
"messages":[{"role":"user","content":"Say hello in one sentence."}]}'
|
|
```
|
|
|
|
**Streaming response** — `Content-Type: text/event-stream`:
|
|
|
|
```
|
|
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Echo:"},"finish_reason":null}]}
|
|
|
|
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" Say"},"finish_reason":null}]}
|
|
|
|
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":6,"completion_tokens":7,"total_tokens":13}}
|
|
|
|
data: [DONE]
|
|
```
|
|
|
|
Works with the OpenAI Python SDK by pointing `base_url` at `http://localhost:8080/v1`.
|
|
|
|
### `GET /v1/models`
|
|
|
|
```bash
|
|
curl http://localhost:8080/v1/models -H "Authorization: Bearer nz_demoKEY..."
|
|
```
|
|
|
|
```json
|
|
{ "object": "list", "data": [
|
|
{ "id": "llama3.1:8b", "object": "model", "owned_by": "neuronetz" },
|
|
{ "id": "mistral:7b", "object": "model", "owned_by": "neuronetz" }
|
|
] }
|
|
```
|
|
|
|
---
|
|
|
|
## Health endpoints
|
|
|
|
| Path | Method | Auth | Purpose |
|
|
|---|---|---|---|
|
|
| `/healthz` | GET | none | Liveness — process responsive (`200`). |
|
|
| `/readyz` | GET | none | Readiness — DB + Redis + Ollama reachable, else `503`. |
|
|
| `/metrics` | GET | none (loopback only) | Prometheus exposition. |
|
|
|
|
```bash
|
|
curl -i http://localhost:8080/healthz # 200 {"status":"ok"}
|
|
curl -i http://localhost:8080/readyz # 200 when all deps up; 503 otherwise
|
|
```
|
|
|
|
---
|
|
|
|
## Quick reference: streaming formats
|
|
|
|
| Surface | Content-Type | Frame | Terminator |
|
|
|---|---|---|---|
|
|
| Native `/api/*` | `application/x-ndjson` | one JSON object per `\n` | final object has `"done": true` |
|
|
| OpenAI `/v1/*` | `text/event-stream` | `data: {…}\n\n` | `data: [DONE]\n\n` |
|