proxy: multi-backend Ollama aggregation with per-model routing + failover
Some checks failed
CI / ruff (push) Has been cancelled
CI / mypy --strict (push) Has been cancelled
CI / pytest (push) Has been cancelled
CI / bandit (push) Has been cancelled
CI / pip-audit (push) Has been cancelled

The gateway can now aggregate models across SEVERAL Ollama backends and
route each request to the correct one. Opt-in via OLLAMA_BACKENDS in .env
— single-backend deployments are unaffected (effective_backends()
synthesizes a single "default" backend from the legacy OLLAMA_BASE_URL /
OLLAMA_AUTH_TOKEN fields when the list is empty).

Behavior:
- Discovery polls EVERY configured backend in parallel each tick; the
  cache stores per-backend model lists plus a model → backends priority
  list (config order = priority order).
- /api/tags and /v1/models surface the DEDUPLICATED UNION of all
  backends' models.
- A request's model is looked up in the priority list and proxied to the
  FIRST backend that hosts it. If that backend errors on the request, the
  pipeline transparently fails over to the next backend that has the
  same model (the streaming-failover probes the first chunk before
  releasing the response, so we never serve partial bytes from a dead
  backend).
- No existence disclosure: a model not hosted by any backend yields the
  same generic 403 as "model not allowed" (SPEC §13.6 preserved).

Components:
- config.py: new BackendSpec model + ollama_backends list field + an
  effective_backends() helper.
- proxy/router.py (new): BackendRouter (clients_for_with_failover),
  build_http_clients() builds one httpx client per backend with its own
  auth headers, build_backend_headers() exposes the per-backend header
  composition for the CLI probe.
- proxy/discovery.py: DiscoveryCache.set_per_backend() + backends_for(),
  refresh_all_backends() polls all in parallel, discovery_loop_multi()
  replaces the single-backend loop in production; the legacy single-
  backend functions are kept for the dependency-override tests.
- proxy/pipeline.py: Pipeline accepts an optional router; the four proxy
  methods now retry against each candidate backend in priority order on
  transport error.
- lifespan.py: constructs the per-backend client dict, stores the router
  on app.state, launches discovery_loop_multi.
- deps.py: get_backend_router provider + BackendRouterDep type alias;
  get_pipeline passes the router into Pipeline.
- cli/manage.py: probe-ollama iterates every backend and reports per-
  backend status; list-models groups its output by backend and prints
  the union count + Redis cache size for sanity.
- .env.example + docker-compose.yml: document and pass through
  OLLAMA_BACKENDS with a real example.

Verified: ruff check (clean), mypy --strict src/ + tests/ (clean,
66 source files), pytest (60 passed + 39 skipped — same baseline as
before this change; integration tests are Docker-socket-gated).
This commit is contained in:
Stephan Berbig
2026-05-27 22:30:26 +02:00
parent 5044a44a17
commit 653e03bf29
9 changed files with 607 additions and 61 deletions

View File

@@ -40,6 +40,7 @@ from neuronetz_gateway.errors import AuthenticationError, DependencyUnavailableE
from neuronetz_gateway.proxy.discovery import DiscoveryCache
from neuronetz_gateway.proxy.ollama import OllamaClient
from neuronetz_gateway.proxy.pipeline import Pipeline
from neuronetz_gateway.proxy.router import BackendRouter
from neuronetz_gateway.ratelimit.concurrency import ConcurrencyLimiter
from neuronetz_gateway.ratelimit.sliding_window import SlidingWindowLimiter
@@ -66,10 +67,24 @@ def get_http_client(request: Request) -> httpx.AsyncClient:
def get_ollama_client(request: Request) -> OllamaClient:
"""Provide the upstream Ollama proxy client (override target for tests)."""
"""Provide the upstream Ollama proxy client (override target for tests).
In multi-backend mode this returns the FIRST backend's client (priority
order = list order). The pipeline uses :func:`get_backend_router` for
per-model routing + failover; this provider is kept for tests and for code
paths that don't need routing.
"""
return OllamaClient(get_http_client(request))
def get_backend_router(request: Request) -> BackendRouter:
"""Provide the multi-backend router (one client per configured backend)."""
router: BackendRouter | None = getattr(request.app.state, "backend_router", None)
if router is None:
raise DependencyUnavailableError(internal_detail="backend router not initialised")
return router
def get_discovery_cache(request: Request) -> DiscoveryCache:
"""Provide the in-process discovery cache; fail closed if absent."""
cache: DiscoveryCache | None = getattr(request.app.state, "discovery_cache", None)
@@ -112,10 +127,17 @@ def get_pipeline(
The pipeline owns all hot-path checks (rate limit, budget, concurrency,
model/endpoint allowlist) and the streaming-with-bookkeeping contract.
Audit deny-mode flips this to fail closed at the route layer.
In multi-backend deployments the per-request backend selection is done by
the pipeline using the :class:`BackendRouter` on ``app.state``; the
``ollama`` argument here is the fallback single-backend client (used when
the router has no entry for a model, and as the override target for tests
that don't care about routing).
"""
sessionmaker: async_sessionmaker[AsyncSession] | None = getattr(
request.app.state, "db_sessionmaker", None
)
router: BackendRouter | None = getattr(request.app.state, "backend_router", None)
return Pipeline(
request=request,
principal=principal,
@@ -127,6 +149,7 @@ def get_pipeline(
budget=BudgetCounter(redis_client),
audit=audit,
sessionmaker=sessionmaker,
router=router,
)
@@ -151,6 +174,7 @@ ConfigDep = Annotated[Settings, Depends(get_config)]
RedisDep = Annotated[redis.Redis, Depends(get_redis)]
HttpClientDep = Annotated[httpx.AsyncClient, Depends(get_http_client)]
OllamaClientDep = Annotated[OllamaClient, Depends(get_ollama_client)]
BackendRouterDep = Annotated[BackendRouter, Depends(get_backend_router)]
DiscoveryCacheDep = Annotated[DiscoveryCache, Depends(get_discovery_cache)]
PrincipalDep = Annotated[Principal, Depends(get_principal)]
AuditWriterDep = Annotated[AuditWriter, Depends(get_audit_writer)]
@@ -160,6 +184,7 @@ DbSessionDep = Annotated[AsyncSession, Depends(get_db_session)]
__all__ = [
"AuditWriterDep",
"BackendRouterDep",
"ConfigDep",
"DbSessionDep",
"DiscoveryCacheDep",
@@ -169,6 +194,7 @@ __all__ = [
"PrincipalDep",
"RedisDep",
"get_audit_writer",
"get_backend_router",
"get_config",
"get_db_session",
"get_discovery_cache",