proxy: streaming, discovery, OpenAI-compat, rate-limit, budget, audit

The hot path. A single Pipeline class owns enforcement so the eight non-negotiables can be reviewed in one place. - Native /api/chat, /api/generate (NDJSON streaming + non-stream), /api/tags, /api/show (system-prompt + template stripped), /api/embed(dings), /api/version (returns gateway version, not Ollama's). Endpoint catch-all returns the same generic 403 for hard-blocked and unknown /api/* paths so attackers cannot enumerate which mutating endpoints exist. - OpenAI-compat /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models with SSE (`data: {...}` + final `data: [DONE]`); preserves streaming end-to-end. - Model discovery (SPEC §4.6): background poller against Ollama /api/tags; Redis + in-process cache (TTL = MODEL_DISCOVERY_CACHE_TTL_S, refresh = MODEL_DISCOVERY_REFRESH_S); fail-closed when the discovered set is empty. - Effective-set resolution in proxy/allowlist.py: allow_all = key.allow_all_models ?? tenant.allow_all_models effective = discovered if allow_all else (key.allowed_models ?? tenant.allowed_models) ∩ discovered A non-effective model returns the same generic 403 whether it's installed- but-unpermitted or doesn't exist at all (no enumeration leak). - Sliding-window rate limit (Redis Lua, single round-trip) for per-key + per-tenant RPM and per-key TPM. Redis-INCR/DECR concurrency semaphore with TTL guard. Token-budget counters per (key, period) with a Postgres ledger for reconciliation across resets. Headers per SPEC §6.5 on every response; 429 carries Retry-After; Redis outage → 503 (fail closed, never 200). - Token counting from the FINAL stream object (NDJSON `done` or the SSE chunk carrying `usage`); the audit row is written AFTER stream close so TTFB is never degraded by bookkeeping. - Audit writer: asyncio.Queue + bounded ring buffer; deny-mode flip on overflow. Optional prompt log per key (TTL'd). - Revocation listener: asyncpg LISTEN on key_revoked → evict the Redis cache entry within ~1s of the console writing to gateway.revocations. - Prometheus counters/histograms labeled by tenant only (per SPEC §13.3).
2026-05-26 20:52:33 +02:00
parent 6431b2f72c
commit 6a92bc8ce9
20 changed files with 2139 additions and 0 deletions
--- a/src/neuronetz_gateway/observability/metrics.py
+++ b/src/neuronetz_gateway/observability/metrics.py
@@ -0,0 +1,67 @@
+"""Prometheus metrics.
+
+Phase 1 declares the metric objects and the exposition helper. Instrumentation
+(incrementing counters / observing histograms on the request path) is wired in
+later phases. Per SPEC §13.3 we label by ``tenant`` only, never by ``key_id``.
+"""
+
+from __future__ import annotations
+
+from prometheus_client import CollectorRegistry, Counter, Histogram, generate_latest
+
+REGISTRY = CollectorRegistry()
+
+REQUESTS_TOTAL = Counter(
+    "gateway_requests_total",
+    "Total proxied requests.",
+    labelnames=("tenant", "model", "status"),
+    registry=REGISTRY,
+)
+
+TOKENS_TOTAL = Counter(
+    "gateway_tokens_total",
+    "Total tokens accounted, by direction (in|out).",
+    labelnames=("tenant", "model", "direction"),
+    registry=REGISTRY,
+)
+
+REQUEST_DURATION_SECONDS = Histogram(
+    "gateway_request_duration_seconds",
+    "Gateway-side request duration in seconds.",
+    labelnames=("tenant", "model"),
+    registry=REGISTRY,
+)
+
+
+def record_request(tenant: str, model: str, status: int, duration_s: float) -> None:
+    """Increment the request counter and observe its duration (tenant-labeled)."""
+    REQUESTS_TOTAL.labels(tenant=tenant, model=model, status=str(status)).inc()
+    REQUEST_DURATION_SECONDS.labels(tenant=tenant, model=model).observe(duration_s)
+
+
+def record_tokens(tenant: str, model: str, tokens_in: int, tokens_out: int) -> None:
+    """Add input/output token counts to the tokens counter."""
+    if tokens_in:
+        TOKENS_TOTAL.labels(tenant=tenant, model=model, direction="in").inc(tokens_in)
+    if tokens_out:
+        TOKENS_TOTAL.labels(tenant=tenant, model=model, direction="out").inc(tokens_out)
+
+
+def render_latest() -> bytes:
+    """Return the current metrics in Prometheus text exposition format."""
+    payload: bytes = generate_latest(REGISTRY)
+    return payload
+
+
+CONTENT_TYPE_LATEST = "text/plain; version=0.0.4; charset=utf-8"
+
+__all__ = [
+    "CONTENT_TYPE_LATEST",
+    "REGISTRY",
+    "REQUESTS_TOTAL",
+    "REQUEST_DURATION_SECONDS",
+    "TOKENS_TOTAL",
+    "record_request",
+    "record_tokens",
+    "render_latest",
+]