proxy: streaming, discovery, OpenAI-compat, rate-limit, budget, audit

The hot path. A single Pipeline class owns enforcement so the eight
non-negotiables can be reviewed in one place.

- Native /api/chat, /api/generate (NDJSON streaming + non-stream), /api/tags,
  /api/show (system-prompt + template stripped), /api/embed(dings), /api/version
  (returns gateway version, not Ollama's). Endpoint catch-all returns the same
  generic 403 for hard-blocked and unknown /api/* paths so attackers cannot
  enumerate which mutating endpoints exist.
- OpenAI-compat /v1/chat/completions, /v1/completions, /v1/embeddings,
  /v1/models with SSE (`data: {...}` + final `data: [DONE]`); preserves
  streaming end-to-end.
- Model discovery (SPEC §4.6): background poller against Ollama /api/tags;
  Redis + in-process cache (TTL = MODEL_DISCOVERY_CACHE_TTL_S, refresh =
  MODEL_DISCOVERY_REFRESH_S); fail-closed when the discovered set is empty.
- Effective-set resolution in proxy/allowlist.py:
    allow_all = key.allow_all_models ?? tenant.allow_all_models
    effective = discovered if allow_all
                else (key.allowed_models ?? tenant.allowed_models) ∩ discovered
  A non-effective model returns the same generic 403 whether it's installed-
  but-unpermitted or doesn't exist at all (no enumeration leak).
- Sliding-window rate limit (Redis Lua, single round-trip) for per-key +
  per-tenant RPM and per-key TPM. Redis-INCR/DECR concurrency semaphore with
  TTL guard. Token-budget counters per (key, period) with a Postgres ledger
  for reconciliation across resets. Headers per SPEC §6.5 on every response;
  429 carries Retry-After; Redis outage → 503 (fail closed, never 200).
- Token counting from the FINAL stream object (NDJSON `done` or the SSE chunk
  carrying `usage`); the audit row is written AFTER stream close so TTFB is
  never degraded by bookkeeping.
- Audit writer: asyncio.Queue + bounded ring buffer; deny-mode flip on overflow.
  Optional prompt log per key (TTL'd).
- Revocation listener: asyncpg LISTEN on key_revoked → evict the Redis cache
  entry within ~1s of the console writing to gateway.revocations.
- Prometheus counters/histograms labeled by tenant only (per SPEC §13.3).
This commit is contained in:
Stephan Berbig
2026-05-26 20:52:33 +02:00
parent 6431b2f72c
commit 6a92bc8ce9
20 changed files with 2139 additions and 0 deletions

View File

@@ -0,0 +1,67 @@
"""Prometheus metrics.
Phase 1 declares the metric objects and the exposition helper. Instrumentation
(incrementing counters / observing histograms on the request path) is wired in
later phases. Per SPEC §13.3 we label by ``tenant`` only, never by ``key_id``.
"""
from __future__ import annotations
from prometheus_client import CollectorRegistry, Counter, Histogram, generate_latest
REGISTRY = CollectorRegistry()
REQUESTS_TOTAL = Counter(
"gateway_requests_total",
"Total proxied requests.",
labelnames=("tenant", "model", "status"),
registry=REGISTRY,
)
TOKENS_TOTAL = Counter(
"gateway_tokens_total",
"Total tokens accounted, by direction (in|out).",
labelnames=("tenant", "model", "direction"),
registry=REGISTRY,
)
REQUEST_DURATION_SECONDS = Histogram(
"gateway_request_duration_seconds",
"Gateway-side request duration in seconds.",
labelnames=("tenant", "model"),
registry=REGISTRY,
)
def record_request(tenant: str, model: str, status: int, duration_s: float) -> None:
"""Increment the request counter and observe its duration (tenant-labeled)."""
REQUESTS_TOTAL.labels(tenant=tenant, model=model, status=str(status)).inc()
REQUEST_DURATION_SECONDS.labels(tenant=tenant, model=model).observe(duration_s)
def record_tokens(tenant: str, model: str, tokens_in: int, tokens_out: int) -> None:
"""Add input/output token counts to the tokens counter."""
if tokens_in:
TOKENS_TOTAL.labels(tenant=tenant, model=model, direction="in").inc(tokens_in)
if tokens_out:
TOKENS_TOTAL.labels(tenant=tenant, model=model, direction="out").inc(tokens_out)
def render_latest() -> bytes:
"""Return the current metrics in Prometheus text exposition format."""
payload: bytes = generate_latest(REGISTRY)
return payload
CONTENT_TYPE_LATEST = "text/plain; version=0.0.4; charset=utf-8"
__all__ = [
"CONTENT_TYPE_LATEST",
"REGISTRY",
"REQUESTS_TOTAL",
"REQUEST_DURATION_SECONDS",
"TOKENS_TOTAL",
"record_request",
"record_tokens",
"render_latest",
]