Production deployment now matches the host setup that already runs
neuronetz.ai / neuro-landing: the gateway sits behind the jwilder
nginx-proxy + acme-companion already on the host, instead of bundling
its own Caddy sidecar.
- docker-compose.yml: drop the Caddy service entirely. The gateway joins
an external `proxy` Docker network (the same one neuronetz-web /
neuronetz-www use) and advertises itself with VIRTUAL_HOST /
VIRTUAL_PORT / LETSENCRYPT_HOST / LETSENCRYPT_EMAIL. nginx-proxy
routes TLS-terminated traffic to it on the shared network;
acme-companion handles Let's Encrypt issuance + renewal for
api.neuronetz.ai automatically. NO host ports are published in this
compose file anywhere — gateway, postgres, redis, ollama all stay
unreachable from the host. Pinned container_names
(neuronetz-gateway / -postgres / -redis / -ollama) for stable
identification by nginx-proxy and ops scripts.
- .env.example: add GATEWAY_VIRTUAL_HOST + LETSENCRYPT_EMAIL; flip the
default GATEWAY_TRUSTED_PROXIES to `127.0.0.1,nginx-proxy`.
- docs/DEPLOYMENT.md: the canonical path is now jwilder-proxy.
Reorganized prerequisites + steps around it; documented adding HSTS
and the other security headers via the nginx-proxy custom-config
mechanism (/etc/nginx/vhost.d/<host>). The Caddy sidecar lives on as
a documented alternative for hosts without jwilder-proxy
(ops/caddy/Caddyfile.example is kept).
The Ollama-never-exposed non-negotiable is unchanged.
One-command demo so the gateway can be exercised end-to-end without a GPU or a
real model download:
- demo/mock-ollama/ — tiny FastAPI service emulating Ollama (/api/tags,
/api/chat + /api/generate NDJSON streaming with realistic prompt_eval_count
and eval_count on the final frame, /api/embed, /api/show, /api/version).
Non-root multi-stage Dockerfile, never published (internal network only).
- docker-compose.demo.yml — postgres + redis + mock-ollama + gateway, with
PLAYGROUND_ENABLED=true and ./playground mounted read-only at /app/playground.
Mirrors the prod posture (mock-ollama not exposed).
- demo.sh — brings the stack up, waits on /healthz, creates a demo tenant with
allow_all_models and a fresh API key via the bootstrap CLI inside the
container, then prints the key, the playground URL, and five ready-to-paste
curl commands (SSE chat, NDJSON chat, /v1/models, a 401, a 403 /api/pull).
./demo.sh --down tears everything back down with volumes.
- playground/index.html — single-file dark-themed UI served same-origin by
the gateway at /playground (CORS-free). Per-endpoint About card with method/
auth/streaming badges, a real description, sample request body, sample
response, and a footer note. Live SSE/NDJSON rendering of the response.
A live, copyable curl box that mirrors exactly what Run sends. Run + Refresh
are visibly gated until an API key is in the field; the Base URL is
force-pinned to location.origin three times to defeat browser autofill.
- docs/ — API.md (full endpoint reference with curl, streaming formats, error
model, SPEC §6.5 response headers), ARCHITECTURE.md (incl. §4.6 discovery
+ the request lifecycle), DEPLOYMENT.md (Ollama-never-exposed rule,
pointing at a real Ollama backend, env reference), THREAT_MODEL.md
(SPEC §3 table + the allow_all_models opt-in notes), OPERATIONS.md
(key/budget/model/usage runbook + fail-closed table), PLAYGROUND.md.
mkdocs.yml (Material theme) wires them together.
The hot path. A single Pipeline class owns enforcement so the eight
non-negotiables can be reviewed in one place.
- Native /api/chat, /api/generate (NDJSON streaming + non-stream), /api/tags,
/api/show (system-prompt + template stripped), /api/embed(dings), /api/version
(returns gateway version, not Ollama's). Endpoint catch-all returns the same
generic 403 for hard-blocked and unknown /api/* paths so attackers cannot
enumerate which mutating endpoints exist.
- OpenAI-compat /v1/chat/completions, /v1/completions, /v1/embeddings,
/v1/models with SSE (`data: {...}` + final `data: [DONE]`); preserves
streaming end-to-end.
- Model discovery (SPEC §4.6): background poller against Ollama /api/tags;
Redis + in-process cache (TTL = MODEL_DISCOVERY_CACHE_TTL_S, refresh =
MODEL_DISCOVERY_REFRESH_S); fail-closed when the discovered set is empty.
- Effective-set resolution in proxy/allowlist.py:
allow_all = key.allow_all_models ?? tenant.allow_all_models
effective = discovered if allow_all
else (key.allowed_models ?? tenant.allowed_models) ∩ discovered
A non-effective model returns the same generic 403 whether it's installed-
but-unpermitted or doesn't exist at all (no enumeration leak).
- Sliding-window rate limit (Redis Lua, single round-trip) for per-key +
per-tenant RPM and per-key TPM. Redis-INCR/DECR concurrency semaphore with
TTL guard. Token-budget counters per (key, period) with a Postgres ledger
for reconciliation across resets. Headers per SPEC §6.5 on every response;
429 carries Retry-After; Redis outage → 503 (fail closed, never 200).
- Token counting from the FINAL stream object (NDJSON `done` or the SSE chunk
carrying `usage`); the audit row is written AFTER stream close so TTFB is
never degraded by bookkeeping.
- Audit writer: asyncio.Queue + bounded ring buffer; deny-mode flip on overflow.
Optional prompt log per key (TTL'd).
- Revocation listener: asyncpg LISTEN on key_revoked → evict the Redis cache
entry within ~1s of the console writing to gateway.revocations.
- Prometheus counters/histograms labeled by tenant only (per SPEC §13.3).
- argon2id hash/verify/needs_rehash; constant-time path; parameters from config.
- Key format nz_<prefix><secret> (12-char stored prefix incl. nz_, 32-char
random secret); the full key is generated with secrets, hashed argon2id, and
printed exactly once at creation — never persisted, never logged.
- Bearer auth middleware: extract → resolve prefix → Redis cache (TTL from
REDIS_KEY_CACHE_TTL_S) → DB → argon2 verify → cache the resolved Principal.
Fail-closed; uniform sanitized 401 with X-Request-ID; per-IP auth-failure
counter to slow brute force. Exempt paths: /healthz /readyz /metrics /, and
/playground when enabled.
- Bootstrap CLI (Typer) per SPEC §11: create-tenant (with --allow-all-models),
create-key, list-keys, revoke-key, set-budget, set-models (--models or
--allow-all / --no-allow-all), show-usage, list-models.
- Async repositories for tenants, api_keys, key_limits, budget_usage,
revocations, audit_log — including the join+inheritance flatten that
produces a Principal with effective rpm/tpm/concurrent/allowed_models/
allow_all_models for the auth cache.
Initial project structure for neuronetz-gateway per scope-docs/SPEC.md:
- Python 3.12 / FastAPI / SQLAlchemy 2.0 (async) / Redis / Postgres stack
managed by uv. Multi-stage non-root Dockerfile, prod + dev compose files
(ollama service is NEVER published in either), Caddyfile + systemd unit,
justfile, GitHub Actions CI (ruff, mypy --strict, pytest, bandit, pip-audit).
- Pydantic-Settings config covering every env var from SPEC §7, including the
MODEL_DISCOVERY_* keys for the dynamic-discovery feature (§4.6).
- Alembic 0001_initial creates the full gateway schema (8 tables, 3 enums,
notify_key_revoked() trigger), incl. allow_all_models on tenant_limits and
key_limits for the per-tenant auto-grant toggle.
- Working /healthz, /readyz (fail-closed when deps unreachable), and a
Prometheus /metrics stub. Sanitizing error handlers that attach X-Request-ID
to every response and never leak upstream internals.
- SPEC + AGENT_PROMPT included under scope-docs/ (source of truth).