proxy: streaming, discovery, OpenAI-compat, rate-limit, budget, audit
The hot path. A single Pipeline class owns enforcement so the eight
non-negotiables can be reviewed in one place.
- Native /api/chat, /api/generate (NDJSON streaming + non-stream), /api/tags,
/api/show (system-prompt + template stripped), /api/embed(dings), /api/version
(returns gateway version, not Ollama's). Endpoint catch-all returns the same
generic 403 for hard-blocked and unknown /api/* paths so attackers cannot
enumerate which mutating endpoints exist.
- OpenAI-compat /v1/chat/completions, /v1/completions, /v1/embeddings,
/v1/models with SSE (`data: {...}` + final `data: [DONE]`); preserves
streaming end-to-end.
- Model discovery (SPEC §4.6): background poller against Ollama /api/tags;
Redis + in-process cache (TTL = MODEL_DISCOVERY_CACHE_TTL_S, refresh =
MODEL_DISCOVERY_REFRESH_S); fail-closed when the discovered set is empty.
- Effective-set resolution in proxy/allowlist.py:
allow_all = key.allow_all_models ?? tenant.allow_all_models
effective = discovered if allow_all
else (key.allowed_models ?? tenant.allowed_models) ∩ discovered
A non-effective model returns the same generic 403 whether it's installed-
but-unpermitted or doesn't exist at all (no enumeration leak).
- Sliding-window rate limit (Redis Lua, single round-trip) for per-key +
per-tenant RPM and per-key TPM. Redis-INCR/DECR concurrency semaphore with
TTL guard. Token-budget counters per (key, period) with a Postgres ledger
for reconciliation across resets. Headers per SPEC §6.5 on every response;
429 carries Retry-After; Redis outage → 503 (fail closed, never 200).
- Token counting from the FINAL stream object (NDJSON `done` or the SSE chunk
carrying `usage`); the audit row is written AFTER stream close so TTFB is
never degraded by bookkeeping.
- Audit writer: asyncio.Queue + bounded ring buffer; deny-mode flip on overflow.
Optional prompt log per key (TTL'd).
- Revocation listener: asyncpg LISTEN on key_revoked → evict the Redis cache
entry within ~1s of the console writing to gateway.revocations.
- Prometheus counters/histograms labeled by tenant only (per SPEC §13.3).
This commit is contained in:
67
src/neuronetz_gateway/observability/metrics.py
Normal file
67
src/neuronetz_gateway/observability/metrics.py
Normal file
@@ -0,0 +1,67 @@
|
||||
"""Prometheus metrics.
|
||||
|
||||
Phase 1 declares the metric objects and the exposition helper. Instrumentation
|
||||
(incrementing counters / observing histograms on the request path) is wired in
|
||||
later phases. Per SPEC §13.3 we label by ``tenant`` only, never by ``key_id``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from prometheus_client import CollectorRegistry, Counter, Histogram, generate_latest
|
||||
|
||||
REGISTRY = CollectorRegistry()
|
||||
|
||||
REQUESTS_TOTAL = Counter(
|
||||
"gateway_requests_total",
|
||||
"Total proxied requests.",
|
||||
labelnames=("tenant", "model", "status"),
|
||||
registry=REGISTRY,
|
||||
)
|
||||
|
||||
TOKENS_TOTAL = Counter(
|
||||
"gateway_tokens_total",
|
||||
"Total tokens accounted, by direction (in|out).",
|
||||
labelnames=("tenant", "model", "direction"),
|
||||
registry=REGISTRY,
|
||||
)
|
||||
|
||||
REQUEST_DURATION_SECONDS = Histogram(
|
||||
"gateway_request_duration_seconds",
|
||||
"Gateway-side request duration in seconds.",
|
||||
labelnames=("tenant", "model"),
|
||||
registry=REGISTRY,
|
||||
)
|
||||
|
||||
|
||||
def record_request(tenant: str, model: str, status: int, duration_s: float) -> None:
|
||||
"""Increment the request counter and observe its duration (tenant-labeled)."""
|
||||
REQUESTS_TOTAL.labels(tenant=tenant, model=model, status=str(status)).inc()
|
||||
REQUEST_DURATION_SECONDS.labels(tenant=tenant, model=model).observe(duration_s)
|
||||
|
||||
|
||||
def record_tokens(tenant: str, model: str, tokens_in: int, tokens_out: int) -> None:
|
||||
"""Add input/output token counts to the tokens counter."""
|
||||
if tokens_in:
|
||||
TOKENS_TOTAL.labels(tenant=tenant, model=model, direction="in").inc(tokens_in)
|
||||
if tokens_out:
|
||||
TOKENS_TOTAL.labels(tenant=tenant, model=model, direction="out").inc(tokens_out)
|
||||
|
||||
|
||||
def render_latest() -> bytes:
|
||||
"""Return the current metrics in Prometheus text exposition format."""
|
||||
payload: bytes = generate_latest(REGISTRY)
|
||||
return payload
|
||||
|
||||
|
||||
CONTENT_TYPE_LATEST = "text/plain; version=0.0.4; charset=utf-8"
|
||||
|
||||
__all__ = [
|
||||
"CONTENT_TYPE_LATEST",
|
||||
"REGISTRY",
|
||||
"REQUESTS_TOTAL",
|
||||
"REQUEST_DURATION_SECONDS",
|
||||
"TOKENS_TOTAL",
|
||||
"record_request",
|
||||
"record_tokens",
|
||||
"render_latest",
|
||||
]
|
||||
Reference in New Issue
Block a user