neuronetz-gateway/docs/API.md

# neuronetz-gateway — API Reference

The gateway exposes two compatible API surfaces in front of the Ollama backend:

- **Native Ollama** under `/api/*` — NDJSON streaming, identical request shapes to Ollama.
- **OpenAI-compatible** under `/v1/*` — SSE streaming, drop-in for the OpenAI SDKs.

Plus unauthenticated health endpoints. Everything else is blocked.

> Source of truth: [`scope-docs/SPEC.md`](../scope-docs/SPEC.md) §6. Where this doc and the
> SPEC disagree, the SPEC wins.

---

## Authentication

Every model endpoint requires an API key as a Bearer token:

```
Authorization: Bearer nz_<12-char-prefix><32-char-random>
```

- **Key format:** `nz_` namespace + random base62 body. The first 12 characters
  (`nz_` + entropy) are the **prefix**, stored in cleartext and indexed for O(1) lookup.
  The full key is **argon2id**-hashed; it is shown **exactly once** at creation
  (`neuronetz-gateway create-key`) and never stored or logged.
- **Fail-closed:** a missing, malformed, expired, disabled, or revoked key returns **401**.
  No upstream/Ollama detail is ever leaked in the error.
- Health endpoints (`/healthz`, `/readyz`) require **no** auth.

The placeholder key `nz_demoKEY...` is used throughout this doc. `./demo.sh` prints a
**real** key for the local demo.

---

## Response headers (SPEC §6.5)

Every proxied response carries:

| Header | Meaning |
|---|---|
| `X-Request-ID` | Correlates the response with the audit log row. Present on errors too. |
| `X-RateLimit-Limit-Requests` | Effective RPM limit for this key/tenant. |
| `X-RateLimit-Remaining-Requests` | Requests remaining in the current window. |
| `X-RateLimit-Limit-Tokens` | Effective TPM limit. |
| `X-RateLimit-Remaining-Tokens` | Tokens remaining in the current window. |
| `X-Budget-Period` | `day` \| `month` \| `total` — the binding budget period. |
| `X-Budget-Tokens-Remaining` | Tokens left in the binding budget period. |

`429 Too Many Requests` responses additionally carry `Retry-After: <seconds>`.

---

## Error model

Errors are **sanitized** at the gateway boundary — Ollama internals are never reflected.
The body is a small generic JSON object and the `X-Request-ID` header ties it to the audit log.

```json
{ "error": { "message": "forbidden", "type": "forbidden", "code": 403 }, "request_id": "b3f1…" }
```

| Status | When |
|---|---|
| `400` | Malformed body, schema violation, or `num_predict` over the cap. |
| `401` | Missing / invalid / expired / revoked key. |
| `403` | Endpoint hard-blocked, or model outside the tenant's effective set (no existence disclosure). |
| `413` | Request body over `MAX_REQUEST_BODY_BYTES` (default 256 KiB). |
| `429` | Rate limit or budget exceeded (carries `Retry-After`). |
| `502` | Ollama upstream unreachable / circuit breaker open. |
| `503` | A required subsystem (Postgres read, Redis) is down — **fail-closed**, never "allow". |

A model that is *installed-but-unpermitted* and a model that is *not installed* return the
**same** generic `403`, to prevent enumeration (SPEC §13.6).

---

## Native Ollama endpoints (`/api/*`)

### `POST /api/chat`

Streamed (NDJSON, default) or non-streamed chat completion.

```bash
curl -N http://localhost:8080/api/chat \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1:8b","stream":true,
       "messages":[{"role":"user","content":"Say hello in one sentence."}]}'
```

**Streaming response** — `Content-Type: application/x-ndjson`, one JSON object per line:

```
{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":"Echo:"},"done":false}
{"model":"llama3.1:8b","created_at":"…","message":{"role":"assistant","content":" Say"},"done":false}
…
{"model":"llama3.1:8b","done":true,"done_reason":"stop",
 "prompt_eval_count":6,"eval_count":7,"total_duration":1234567890,"eval_duration":34567890}
```

The **final** object carries `prompt_eval_count` (tokens in) and `eval_count` (tokens out);
the gateway uses these for precise token accounting (SPEC §4.3 step 12).

**Non-streaming** (`"stream": false`) returns a single JSON object of the same shape with
`"done": true`.

### `POST /api/generate`

Same semantics as `/api/chat` but uses a flat `prompt` string and returns `response`
fields instead of `message` objects.

```bash
curl -N http://localhost:8080/api/generate \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1:8b","stream":true,"prompt":"Write a haiku about routers."}'
```

### `POST /api/embed` / `POST /api/embeddings`

Non-streamed embeddings. `/api/embed` is the newer batch endpoint (field `embeddings`,
a list of vectors); `/api/embeddings` is the legacy single-vector endpoint (field
`embedding`). Ollama returns no `eval_count` for embeddings; cost is charged on
`prompt_eval_count` only (SPEC §13.1).

```bash
curl http://localhost:8080/api/embed \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" \
  -d '{"model":"nomic-embed-text","input":["hello","world"]}'
```

```json
{ "model": "nomic-embed-text", "embeddings": [[0.0, 0.1, …], [0.0, 0.1, …]], "prompt_eval_count": 2 }
```

### `GET /api/tags`

Returns the tenant's **effective** model set — the live-discovered set intersected with the
tenant's allowlist, or *all* discovered models when `allow_all_models` is on. Sourced from
discovery (SPEC §4.6), never a static list.

```bash
curl http://localhost:8080/api/tags -H "Authorization: Bearer nz_demoKEY..."
```

### `POST /api/show`

Allowed only for models in the effective set; returns **sanitized** model info.
The system prompt and template that Ollama returns are **stripped** by the gateway.

### `GET /api/version`

Returns the **gateway** version, not the Ollama version.

```json
{ "version": "0.1.0" }
```

---

## Hard-blocked endpoints (always `403`)

These model-mutating endpoints are blocked at the gateway. **Not configurable, not behind a
flag** (SPEC §6.2, AGENT_PROMPT non-negotiable #5):

```
/api/pull   /api/push   /api/create   /api/copy   /api/delete   /api/blobs/*
```

```bash
# Always 403, even with a valid key:
curl -i http://localhost:8080/api/pull \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" -d '{"model":"llama3.1:8b"}'
```

`GET /api/ps` is also blocked (it would leak which models are loaded).

---

## OpenAI-compatible endpoints (`/v1/*`)

| Path | Method | Maps to |
|---|---|---|
| `/v1/chat/completions` | POST | `/api/chat` |
| `/v1/completions` | POST | `/api/generate` |
| `/v1/embeddings` | POST | `/api/embed` |
| `/v1/models` | GET | `/api/tags` (effective set, OpenAI list format) |

Streaming uses **SSE**: `data: {…}\n\n` events terminated by a literal `data: [DONE]\n\n`.

### `POST /v1/chat/completions`

```bash
curl -N http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer nz_demoKEY..." \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1:8b","stream":true,
       "messages":[{"role":"user","content":"Say hello in one sentence."}]}'
```

**Streaming response** — `Content-Type: text/event-stream`:

```
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Echo:"},"finish_reason":null}]}

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" Say"},"finish_reason":null}]}

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":6,"completion_tokens":7,"total_tokens":13}}

data: [DONE]
```

Works with the OpenAI Python SDK by pointing `base_url` at `http://localhost:8080/v1`.

### `GET /v1/models`

```bash
curl http://localhost:8080/v1/models -H "Authorization: Bearer nz_demoKEY..."
```

```json
{ "object": "list", "data": [
  { "id": "llama3.1:8b", "object": "model", "owned_by": "neuronetz" },
  { "id": "mistral:7b",  "object": "model", "owned_by": "neuronetz" }
] }
```

---

## Health endpoints

| Path | Method | Auth | Purpose |
|---|---|---|---|
| `/healthz` | GET | none | Liveness — process responsive (`200`). |
| `/readyz` | GET | none | Readiness — DB + Redis + Ollama reachable, else `503`. |
| `/metrics` | GET | none (loopback only) | Prometheus exposition. |

```bash
curl -i http://localhost:8080/healthz   # 200 {"status":"ok"}
curl -i http://localhost:8080/readyz    # 200 when all deps up; 503 otherwise
```

---

## Quick reference: streaming formats

| Surface | Content-Type | Frame | Terminator |
|---|---|---|---|
| Native `/api/*` | `application/x-ndjson` | one JSON object per `\n` | final object has `"done": true` |
| OpenAI `/v1/*` | `text/event-stream` | `data: {…}\n\n` | `data: [DONE]\n\n` |