The gateway can now aggregate models across SEVERAL Ollama backends and route each request to the correct one. Opt-in via OLLAMA_BACKENDS in .env — single-backend deployments are unaffected (effective_backends() synthesizes a single "default" backend from the legacy OLLAMA_BASE_URL / OLLAMA_AUTH_TOKEN fields when the list is empty). Behavior: - Discovery polls EVERY configured backend in parallel each tick; the cache stores per-backend model lists plus a model → backends priority list (config order = priority order). - /api/tags and /v1/models surface the DEDUPLICATED UNION of all backends' models. - A request's model is looked up in the priority list and proxied to the FIRST backend that hosts it. If that backend errors on the request, the pipeline transparently fails over to the next backend that has the same model (the streaming-failover probes the first chunk before releasing the response, so we never serve partial bytes from a dead backend). - No existence disclosure: a model not hosted by any backend yields the same generic 403 as "model not allowed" (SPEC §13.6 preserved). Components: - config.py: new BackendSpec model + ollama_backends list field + an effective_backends() helper. - proxy/router.py (new): BackendRouter (clients_for_with_failover), build_http_clients() builds one httpx client per backend with its own auth headers, build_backend_headers() exposes the per-backend header composition for the CLI probe. - proxy/discovery.py: DiscoveryCache.set_per_backend() + backends_for(), refresh_all_backends() polls all in parallel, discovery_loop_multi() replaces the single-backend loop in production; the legacy single- backend functions are kept for the dependency-override tests. - proxy/pipeline.py: Pipeline accepts an optional router; the four proxy methods now retry against each candidate backend in priority order on transport error. - lifespan.py: constructs the per-backend client dict, stores the router on app.state, launches discovery_loop_multi. - deps.py: get_backend_router provider + BackendRouterDep type alias; get_pipeline passes the router into Pipeline. - cli/manage.py: probe-ollama iterates every backend and reports per- backend status; list-models groups its output by backend and prints the union count + Redis cache size for sanity. - .env.example + docker-compose.yml: document and pass through OLLAMA_BACKENDS with a real example. Verified: ruff check (clean), mypy --strict src/ + tests/ (clean, 66 source files), pytest (60 passed + 39 skipped — same baseline as before this change; integration tests are Docker-socket-gated).
108 lines
5.4 KiB
Plaintext
108 lines
5.4 KiB
Plaintext
# neuronetz-gateway — environment configuration (SPEC §7).
|
|
#
|
|
# Copy to `.env` and adjust. `.env` is gitignored and MUST NOT be committed.
|
|
# All values here are SAFE EXAMPLES — change every secret before any real deploy.
|
|
|
|
# ──────────────────────────── Service ────────────────────────────
|
|
GATEWAY_BIND_HOST=0.0.0.0
|
|
GATEWAY_BIND_PORT=8080
|
|
GATEWAY_LOG_LEVEL=INFO
|
|
GATEWAY_LOG_FORMAT=json # json|console
|
|
GATEWAY_REQUEST_ID_HEADER=X-Request-ID
|
|
GATEWAY_TRUSTED_PROXIES=127.0.0.1,nginx-proxy # for X-Forwarded-For
|
|
|
|
# ──────────── Public hostname (jwilder-proxy / acme-companion) ───────
|
|
# These are consumed by docker-compose.yml's gateway service so that the
|
|
# host's nginx-proxy stack routes TLS-terminated traffic for your domain.
|
|
# Mirrors the pattern used by neuro-landing.
|
|
GATEWAY_VIRTUAL_HOST=api.neuronetz.ai
|
|
LETSENCRYPT_EMAIL=admin@neuronetz.ai
|
|
|
|
# ──────────────────────── Volume adoption ────────────────────────
|
|
# Override the Docker volume names if an EXISTING volume on the host holds
|
|
# data this stack should adopt (e.g. models pulled by a previous Ollama
|
|
# deployment). Leave unset to use the default per-project names.
|
|
#
|
|
# Example (matches the neuronetz-ai-01 host):
|
|
# OLLAMA_DATA_VOLUME=neuro-ollama_ollama-data
|
|
# POSTGRES_DATA_VOLUME=neuro-gateway_postgres_data
|
|
OLLAMA_DATA_VOLUME=
|
|
POSTGRES_DATA_VOLUME=
|
|
|
|
# ──────────────────────────── Upstream ───────────────────────────
|
|
OLLAMA_BASE_URL=http://ollama:11434
|
|
OLLAMA_CONNECT_TIMEOUT_S=5
|
|
OLLAMA_READ_TIMEOUT_S=600
|
|
OLLAMA_MAX_CONNECTIONS=64
|
|
# If you front Ollama with an auth proxy (e.g. an external host like
|
|
# https://ollama.neuronetz.ai requiring a Bearer token), set the token here.
|
|
# The value never appears in logs/errors — it's wrapped in pydantic SecretStr.
|
|
# Leave empty to send no Authorization header (the default for an in-stack
|
|
# ollama service on the private Docker network).
|
|
OLLAMA_AUTH_TOKEN=
|
|
# Override only if your auth proxy expects a non-standard header. For
|
|
# Authorization the scheme prefix (default: Bearer) is included; for any other
|
|
# header name the raw token is sent.
|
|
OLLAMA_AUTH_HEADER=Authorization
|
|
OLLAMA_AUTH_SCHEME=Bearer
|
|
|
|
# ─────────────────────── Multi-backend (opt-in) ──────────────────
|
|
# Aggregate models across SEVERAL Ollama backends. When set (non-empty JSON
|
|
# list), this REPLACES the single-backend config above — include each backend
|
|
# explicitly, in priority order. Requests for a given model route to the
|
|
# FIRST backend that hosts it; on transport errors the gateway transparently
|
|
# fails over to the next backend that has the same model.
|
|
#
|
|
# Each entry: {name, base_url, [auth_token], [auth_header], [auth_scheme]}
|
|
# Example (embedded GPU container + publicly-fronted auth-protected Ollama):
|
|
# OLLAMA_BACKENDS='[
|
|
# {"name":"embedded","base_url":"http://ollama:11434"},
|
|
# {"name":"public","base_url":"https://ollama.neuronetz.ai","auth_token":"YOUR_TOKEN"}
|
|
# ]'
|
|
OLLAMA_BACKENDS=
|
|
|
|
# ──────────────────────── Model discovery (§4.6) ─────────────────
|
|
MODEL_DISCOVERY_REFRESH_S=60
|
|
MODEL_DISCOVERY_CACHE_TTL_S=120
|
|
|
|
# ──────────────────────────── Database ───────────────────────────
|
|
# Compose builds DATABASE_URL from the POSTGRES_* parts below, but the gateway
|
|
# also accepts a full DATABASE_URL directly.
|
|
DATABASE_URL=postgresql+asyncpg://gateway:changeme@postgres:5432/neuronetz
|
|
DATABASE_POOL_SIZE=10
|
|
DATABASE_POOL_OVERFLOW=20
|
|
|
|
# Postgres container credentials (consumed by docker-compose).
|
|
POSTGRES_USER=gateway
|
|
POSTGRES_PASSWORD=changeme
|
|
POSTGRES_DB=neuronetz
|
|
|
|
# ──────────────────────────── Redis ──────────────────────────────
|
|
REDIS_URL=redis://redis:6379/0
|
|
REDIS_KEY_CACHE_TTL_S=60
|
|
|
|
# ────────────────── Limits (defaults; DB overrides) ──────────────
|
|
DEFAULT_RPM=60
|
|
DEFAULT_TPM=100000
|
|
DEFAULT_CONCURRENT=8
|
|
MAX_REQUEST_BODY_BYTES=262144
|
|
MAX_NUM_PREDICT=4096
|
|
|
|
# ──────────────────────────── Security ───────────────────────────
|
|
ARGON2_TIME_COST=3
|
|
ARGON2_MEMORY_COST_KIB=65536
|
|
ARGON2_PARALLELISM=4
|
|
AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN=20
|
|
|
|
# ──────────────────────────── Audit ──────────────────────────────
|
|
AUDIT_BUFFER_SIZE=1000
|
|
PROMPT_LOG_DEFAULT_RETENTION_DAYS=30
|
|
AUDIT_LOG_DEFAULT_RETENTION_DAYS=365
|
|
|
|
# ──────────────── Playground / API docs (prod-safe: OFF) ─────────
|
|
# Serve the playground HTML (owned by the docs agent) at /playground.
|
|
PLAYGROUND_ENABLED=false
|
|
PLAYGROUND_FILE=/app/playground/index.html
|
|
# Enable FastAPI's /docs + /openapi.json (default off in production).
|
|
DOCS_ENABLED=false
|