One-command demo so the gateway can be exercised end-to-end without a GPU or a real model download: - demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags, /api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count and eval_count on the final frame, /api/embed, /api/show, /api/version). Non-root multi-stage Dockerfile, never published (internal network only). - docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground. Mirrors the prod posture (mock-ollama not exposed). - demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with allow_all_models and a fresh API key via the bootstrap CLI inside the container, then prints the key, the playground URL, and five ready-to-paste curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull). ./demo.sh --down tears everything back down with volumes. - playground/index.html — single-file dark-themed UI served same-origin by the gateway at /playground (CORS-free). Per-endpoint About card with method/ auth/streaming badges, a real description, sample request body, sample response, and a footer note. Live SSE/NDJSON rendering of the response. A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh are visibly gated until an API key is in the field; the Base URL is force-pinned to location.origin three times to defeat browser autofill. - docs/ — API.md (full endpoint reference with curl, streaming formats, error model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery + the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule, pointing at a real Ollama backend, env reference), THREAT_MODEL.md (SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md (key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md. mkdocs.yml (Material theme) wires them together.
189 lines
7.1 KiB
Markdown
189 lines
7.1 KiB
Markdown
# neuronetz-gateway — Deployment
|
|
|
|
Production deployment is a single Docker Compose stack: **Caddy + gateway + Postgres + Redis
|
|
+ Ollama**. Caddy is the only public-facing component; it terminates TLS via Let's Encrypt
|
|
for `api.neuronetz.ai` and reverse-proxies to the internal-only gateway.
|
|
|
|
> For the local, no-GPU demo (mock Ollama + playground), see [`PLAYGROUND.md`](PLAYGROUND.md)
|
|
> and run `./demo.sh`. This document is the **production** path.
|
|
|
|
---
|
|
|
|
## The one rule that must never break
|
|
|
|
> ## ⛔ Ollama is NEVER exposed to the host or the internet.
|
|
>
|
|
> The `ollama` service in `docker-compose.yml` has **no `ports:` mapping** and must never
|
|
> get one. Ollama is reachable only on the internal Docker network as `ollama:11434`.
|
|
> Publishing it would re-open the exact unauthenticated exposure this whole project exists
|
|
> to close (SPEC §1, §3; AGENT_PROMPT non-negotiable #2).
|
|
|
|
The same posture applies to **Postgres** and **Redis** in the production compose file — no
|
|
published ports. Only **Caddy** binds host ports (80/443, 443/udp for HTTP/3).
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
- A host with Docker + Docker Compose.
|
|
- DNS: `api.neuronetz.ai` → the host's public IP (for Let's Encrypt).
|
|
- Ports 80 and 443 reachable from the internet (ACME HTTP/TLS challenge + serving).
|
|
|
|
---
|
|
|
|
## Steps
|
|
|
|
```bash
|
|
git clone <repo> neuronetz-gateway && cd neuronetz-gateway
|
|
|
|
# 1. Configure. Copy the example env and change EVERY secret.
|
|
cp .env.example .env
|
|
# - POSTGRES_PASSWORD: a strong, unique value
|
|
# - DATABASE_URL: must match the POSTGRES_* values
|
|
# - GATEWAY_LOG_FORMAT=json for production
|
|
|
|
# 2. Configure Caddy for your domain + ACME email.
|
|
cp ops/caddy/Caddyfile.example ops/caddy/Caddyfile # then edit the site + email
|
|
# (docker-compose.yml mounts Caddyfile.example by default; point it at your edited file
|
|
# or edit in place.)
|
|
|
|
# 3. Bring up the full stack. The gateway runs `alembic upgrade head`, then serves.
|
|
docker compose up -d --build
|
|
|
|
# 4. Bootstrap a tenant + key (CLI runs inside the gateway container).
|
|
docker compose exec gateway neuronetz-gateway create-tenant --name acme --rpm 120 --tpm 200000
|
|
docker compose exec gateway neuronetz-gateway create-key --tenant acme --name prod-server-1
|
|
# ^ prints the full key ONCE — store it in your secret manager now.
|
|
|
|
# 5. Smoke test (through Caddy / TLS).
|
|
curl https://api.neuronetz.ai/healthz
|
|
curl -N https://api.neuronetz.ai/v1/chat/completions \
|
|
-H "Authorization: Bearer nz_…" -H "Content-Type: application/json" \
|
|
-d '{"model":"llama3.1:8b","stream":true,"messages":[{"role":"user","content":"hi"}]}'
|
|
```
|
|
|
|
Caddy obtains and renews the certificate automatically. For local testing without a public
|
|
domain, use the `localhost { tls internal … }` block documented in `Caddyfile.example`
|
|
(trust Caddy's local CA or pass `-k` to curl).
|
|
|
|
---
|
|
|
|
## Pointing at a real Ollama backend
|
|
|
|
The gateway reaches Ollama via `OLLAMA_BASE_URL`. In the bundled stack this is the in-stack
|
|
`ollama` service: `OLLAMA_BASE_URL=http://ollama:11434`.
|
|
|
|
To use an **existing/external** Ollama host instead:
|
|
|
|
1. Remove the `ollama` service from `docker-compose.yml` (or leave it; it just won't be used).
|
|
2. Set `OLLAMA_BASE_URL` to the backend address reachable from the gateway container, e.g.
|
|
`http://10.0.0.5:11434` or an internal DNS name.
|
|
3. Ensure that backend is itself **not** exposed to the internet — the gateway is the only
|
|
thing that should ever reach it. Use a private network / firewall rule, not a public port.
|
|
4. Pull the models you want available on that backend. They appear in tenants' effective sets
|
|
automatically on the next discovery refresh (SPEC §4.6) — no gateway config change for
|
|
`allow_all_models` tenants.
|
|
|
|
Discovery polls `OLLAMA_BASE_URL/api/tags` every `MODEL_DISCOVERY_REFRESH_S` seconds. If the
|
|
backend is unreachable, the discovered set is empty and requests **fail closed**.
|
|
|
|
---
|
|
|
|
## Environment reference (SPEC §7)
|
|
|
|
All configuration is via environment variables, validated by Pydantic Settings on boot. Boot
|
|
**fails loudly** on invalid config. See [`.env.example`](../.env.example) for a copyable file.
|
|
|
|
### Service
|
|
| Var | Default | Notes |
|
|
|---|---|---|
|
|
| `GATEWAY_BIND_HOST` | `0.0.0.0` | Bind-all inside the container. |
|
|
| `GATEWAY_BIND_PORT` | `8080` | Internal port; never published directly in prod. |
|
|
| `GATEWAY_LOG_LEVEL` | `INFO` | |
|
|
| `GATEWAY_LOG_FORMAT` | `json` | `json` in prod, `console` for local dev. |
|
|
| `GATEWAY_REQUEST_ID_HEADER` | `X-Request-ID` | |
|
|
| `GATEWAY_TRUSTED_PROXIES` | `127.0.0.1,caddy` | Sources trusted for `X-Forwarded-For`. |
|
|
|
|
### Upstream (Ollama)
|
|
| Var | Default | Notes |
|
|
|---|---|---|
|
|
| `OLLAMA_BASE_URL` | `http://ollama:11434` | Internal address of the backend. |
|
|
| `OLLAMA_CONNECT_TIMEOUT_S` | `5` | |
|
|
| `OLLAMA_READ_TIMEOUT_S` | `600` | Long, for slow generations. |
|
|
| `OLLAMA_MAX_CONNECTIONS` | `64` | httpx pool size. |
|
|
|
|
### Model discovery (§4.6)
|
|
| Var | Default | Notes |
|
|
|---|---|---|
|
|
| `MODEL_DISCOVERY_REFRESH_S` | `60` | How often to re-query `/api/tags`. |
|
|
| `MODEL_DISCOVERY_CACHE_TTL_S` | `120` | Redis TTL for the discovered set. |
|
|
|
|
### Database
|
|
| Var | Default | Notes |
|
|
|---|---|---|
|
|
| `DATABASE_URL` | `postgresql+asyncpg://…` | asyncpg driver. |
|
|
| `DATABASE_POOL_SIZE` | `10` | |
|
|
| `DATABASE_POOL_OVERFLOW` | `20` | |
|
|
|
|
### Redis
|
|
| Var | Default | Notes |
|
|
|---|---|---|
|
|
| `REDIS_URL` | `redis://redis:6379/0` | |
|
|
| `REDIS_KEY_CACHE_TTL_S` | `60` | Resolved-key cache TTL. |
|
|
|
|
### Limits (defaults; per-tenant/key DB overrides win)
|
|
| Var | Default | Notes |
|
|
|---|---|---|
|
|
| `DEFAULT_RPM` | `60` | |
|
|
| `DEFAULT_TPM` | `100000` | |
|
|
| `DEFAULT_CONCURRENT` | `8` | |
|
|
| `MAX_REQUEST_BODY_BYTES` | `262144` | 256 KiB request cap. |
|
|
| `MAX_NUM_PREDICT` | `4096` | Hard cap on requested completion tokens. |
|
|
|
|
### Security
|
|
| Var | Default | Notes |
|
|
|---|---|---|
|
|
| `ARGON2_TIME_COST` | `3` | |
|
|
| `ARGON2_MEMORY_COST_KIB` | `65536` | 64 MiB. |
|
|
| `ARGON2_PARALLELISM` | `4` | |
|
|
| `AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN` | `20` | Throttles auth brute-force per source IP. |
|
|
|
|
### Audit
|
|
| Var | Default | Notes |
|
|
|---|---|---|
|
|
| `AUDIT_BUFFER_SIZE` | `1000` | Ring buffer; full ⇒ deny mode. |
|
|
| `PROMPT_LOG_DEFAULT_RETENTION_DAYS` | `30` | |
|
|
| `AUDIT_LOG_DEFAULT_RETENTION_DAYS` | `365` | |
|
|
|
|
---
|
|
|
|
## TLS & security headers (Caddy)
|
|
|
|
`ops/caddy/Caddyfile.example` already sets:
|
|
|
|
- **HSTS** `max-age=63072000; includeSubDomains; preload`
|
|
- `X-Content-Type-Options: nosniff`
|
|
- `X-Frame-Options: DENY`
|
|
- `Referrer-Policy: no-referrer`
|
|
- strips `Server` and `X-Powered-By`
|
|
|
|
Edit the site address and ACME `email` before deploying.
|
|
|
|
---
|
|
|
|
## Non-Compose (systemd)
|
|
|
|
A systemd unit is provided for hosts that run the image directly (`ops/systemd/`). The
|
|
gateway still requires reachable Postgres, Redis, and Ollama, and the same environment
|
|
variables. TLS in that topology is whatever fronts the host (Caddy, nginx, a load balancer) —
|
|
**Ollama still must not be publicly reachable.**
|
|
|
|
---
|
|
|
|
## Upgrades & migrations
|
|
|
|
The gateway runs `alembic upgrade head` on container start, so a normal
|
|
`docker compose up -d --build` after pulling a new version applies pending migrations. For
|
|
zero-downtime upgrades, run migrations as a one-off
|
|
(`docker compose run --rm gateway alembic upgrade head`) before rolling the service.
|