Commit Graph

1 Commits

Author SHA1 Message Date
Stephan Berbig
6a92bc8ce9 proxy: streaming, discovery, OpenAI-compat, rate-limit, budget, audit
The hot path. A single Pipeline class owns enforcement so the eight
non-negotiables can be reviewed in one place.

- Native /api/chat, /api/generate (NDJSON streaming + non-stream), /api/tags,
  /api/show (system-prompt + template stripped), /api/embed(dings), /api/version
  (returns gateway version, not Ollama's). Endpoint catch-all returns the same
  generic 403 for hard-blocked and unknown /api/* paths so attackers cannot
  enumerate which mutating endpoints exist.
- OpenAI-compat /v1/chat/completions, /v1/completions, /v1/embeddings,
  /v1/models with SSE (`data: {...}` + final `data: [DONE]`); preserves
  streaming end-to-end.
- Model discovery (SPEC §4.6): background poller against Ollama /api/tags;
  Redis + in-process cache (TTL = MODEL_DISCOVERY_CACHE_TTL_S, refresh =
  MODEL_DISCOVERY_REFRESH_S); fail-closed when the discovered set is empty.
- Effective-set resolution in proxy/allowlist.py:
    allow_all = key.allow_all_models ?? tenant.allow_all_models
    effective = discovered if allow_all
                else (key.allowed_models ?? tenant.allowed_models) ∩ discovered
  A non-effective model returns the same generic 403 whether it's installed-
  but-unpermitted or doesn't exist at all (no enumeration leak).
- Sliding-window rate limit (Redis Lua, single round-trip) for per-key +
  per-tenant RPM and per-key TPM. Redis-INCR/DECR concurrency semaphore with
  TTL guard. Token-budget counters per (key, period) with a Postgres ledger
  for reconciliation across resets. Headers per SPEC §6.5 on every response;
  429 carries Retry-After; Redis outage → 503 (fail closed, never 200).
- Token counting from the FINAL stream object (NDJSON `done` or the SSE chunk
  carrying `usage`); the audit row is written AFTER stream close so TTFB is
  never degraded by bookkeeping.
- Audit writer: asyncio.Queue + bounded ring buffer; deny-mode flip on overflow.
  Optional prompt log per key (TTL'd).
- Revocation listener: asyncpg LISTEN on key_revoked → evict the Redis cache
  entry within ~1s of the console writing to gateway.revocations.
- Prometheus counters/histograms labeled by tenant only (per SPEC §13.3).
2026-05-26 20:52:33 +02:00