Stephan Berbig 653e03bf29
Some checks failed
CI / ruff (push) Has been cancelled
CI / mypy --strict (push) Has been cancelled
CI / pytest (push) Has been cancelled
CI / bandit (push) Has been cancelled
CI / pip-audit (push) Has been cancelled
proxy: multi-backend Ollama aggregation with per-model routing + failover
The gateway can now aggregate models across SEVERAL Ollama backends and
route each request to the correct one. Opt-in via OLLAMA_BACKENDS in .env
— single-backend deployments are unaffected (effective_backends()
synthesizes a single "default" backend from the legacy OLLAMA_BASE_URL /
OLLAMA_AUTH_TOKEN fields when the list is empty).

Behavior:
- Discovery polls EVERY configured backend in parallel each tick; the
  cache stores per-backend model lists plus a model → backends priority
  list (config order = priority order).
- /api/tags and /v1/models surface the DEDUPLICATED UNION of all
  backends' models.
- A request's model is looked up in the priority list and proxied to the
  FIRST backend that hosts it. If that backend errors on the request, the
  pipeline transparently fails over to the next backend that has the
  same model (the streaming-failover probes the first chunk before
  releasing the response, so we never serve partial bytes from a dead
  backend).
- No existence disclosure: a model not hosted by any backend yields the
  same generic 403 as "model not allowed" (SPEC §13.6 preserved).

Components:
- config.py: new BackendSpec model + ollama_backends list field + an
  effective_backends() helper.
- proxy/router.py (new): BackendRouter (clients_for_with_failover),
  build_http_clients() builds one httpx client per backend with its own
  auth headers, build_backend_headers() exposes the per-backend header
  composition for the CLI probe.
- proxy/discovery.py: DiscoveryCache.set_per_backend() + backends_for(),
  refresh_all_backends() polls all in parallel, discovery_loop_multi()
  replaces the single-backend loop in production; the legacy single-
  backend functions are kept for the dependency-override tests.
- proxy/pipeline.py: Pipeline accepts an optional router; the four proxy
  methods now retry against each candidate backend in priority order on
  transport error.
- lifespan.py: constructs the per-backend client dict, stores the router
  on app.state, launches discovery_loop_multi.
- deps.py: get_backend_router provider + BackendRouterDep type alias;
  get_pipeline passes the router into Pipeline.
- cli/manage.py: probe-ollama iterates every backend and reports per-
  backend status; list-models groups its output by backend and prints
  the union count + Redis cache size for sanity.
- .env.example + docker-compose.yml: document and pass through
  OLLAMA_BACKENDS with a real example.

Verified: ruff check (clean), mypy --strict src/ + tests/ (clean,
66 source files), pytest (60 passed + 39 skipped — same baseline as
before this change; integration tests are Docker-socket-gated).
2026-05-27 22:30:26 +02:00
2026-05-26 20:52:33 +02:00
2026-05-26 20:52:33 +02:00
2026-05-26 20:52:33 +02:00
2026-05-26 20:52:33 +02:00

neuronetz-gateway

A secure, multi-tenant API gateway in front of an Ollama instance. It is the hot path of the Neuronetz API: every request to the models flows through here, authenticated, rate-limited, budgeted, and audited.

The Ollama backend is never reachable from the public internet. It is bound to an internal Docker network with no published ports. All access is via this gateway, behind TLS terminated by Caddy.

Status: v0.1.0 — in development. See scope-docs/SPEC.md for the full specification and scope-docs/AGENT_PROMPT.md for the phased build plan. SPEC.md is the source of truth.

What it does

  • Auth — API keys as Bearer tokens, stored as Argon2id hashes, verified in constant time.
  • Multi-tenant — tenants own keys; limits and budgets inherit tenant → key.
  • Rate limiting — per-key and per-tenant RPM / TPM / concurrent connections.
  • Budgets — daily / monthly / total token budgets, enforced fail-closed.
  • Dual API surface — native Ollama (/api/*) and OpenAI-compatible (/v1/*), both streaming.
  • Hard-blocked mutations/api/pull, /api/push, /api/create, /api/copy, /api/delete, /api/blobs/* always return 403. Not configurable.
  • Audit log — always-on request metadata; opt-in, TTL'd prompt logging per key.

Administration (dashboards, tenant self-service) lives in a separate service, neuronetz-console; it is not part of this repository.

Architecture

Internet ──TLS──> Caddy ──HTTP──> gateway ──┬──> Postgres   (keys, budgets, audit)
                                            ├──> Redis      (key cache, rate limits)
                                            └──> Ollama     (internal network only)

Quickstart (dev)

Requires Docker + Docker Compose. The dev stack runs Postgres, Redis, and the gateway — no Caddy and no Ollama (so /readyz reports 503 until a real Ollama backend is wired in; that is expected).

git clone <repo> neuronetz-gateway && cd neuronetz-gateway
cp .env.example .env          # adjust if you like; defaults work for local dev
docker compose -f docker-compose.dev.yml up --build

The gateway runs alembic upgrade head on startup, then serves on http://localhost:8080.

curl -i http://localhost:8080/healthz   # -> 200  {"status":"ok"}
curl -i http://localhost:8080/readyz    # -> 503  (no Ollama backend in the dev stack)

Production

docker-compose.yml brings up the full stack — Caddy (TLS via Let's Encrypt for api.neuronetz.ai), the gateway, Postgres, Redis, and Ollama. The ollama service has no ports: mapping and is reachable only on the internal Docker network. See docs/DEPLOYMENT.md (added in a later phase) and ops/caddy/Caddyfile.example.

Managing tenants and keys

Use the bootstrap CLI (Typer). Keys have the form nz_<prefix><secret>; the full key is printed exactly once at creation and only its Argon2id hash is stored.

neuronetz-gateway create-tenant --name acme
neuronetz-gateway create-key   --tenant acme --name prod-server-1
neuronetz-gateway list-keys    --tenant acme
neuronetz-gateway revoke-key   --prefix nz_abc12345

Development

just dev          # run the dev stack
just test         # pytest + coverage
just lint         # ruff
just typecheck    # mypy --strict
just migrate      # alembic upgrade head

Tooling: Python 3.12, uv, FastAPI + uvicorn, SQLAlchemy 2.0 (async) + asyncpg, Redis, httpx, structlog, Pydantic. Lint/type/security gates: ruff, mypy --strict, bandit, pip-audit.

License

Apache 2.0 — see LICENSE. Owner: Stephan Berbig / Neuronetz.

Description
AI API
Readme Apache-2.0 290 KiB
Languages
Python 86.2%
HTML 8.1%
Shell 4.4%
Dockerfile 0.9%
Just 0.4%