Files
Stephan Berbig d79f17b3bb scaffold: project skeleton, schema, healthz/readyz, CI
Initial project structure for neuronetz-gateway per scope-docs/SPEC.md:

- Python 3.12 / FastAPI / SQLAlchemy 2.0 (async) / Redis / Postgres stack
  managed by uv. Multi-stage non-root Dockerfile, prod + dev compose files
  (ollama service is NEVER published in either), Caddyfile + systemd unit,
  justfile, GitHub Actions CI (ruff, mypy --strict, pytest, bandit, pip-audit).
- Pydantic-Settings config covering every env var from SPEC §7, including the
  MODEL_DISCOVERY_* keys for the dynamic-discovery feature (§4.6).
- Alembic 0001_initial creates the full gateway schema (8 tables, 3 enums,
  notify_key_revoked() trigger), incl. allow_all_models on tenant_limits and
  key_limits for the per-tenant auto-grant toggle.
- Working /healthz, /readyz (fail-closed when deps unreachable), and a
  Prometheus /metrics stub. Sanitizing error handlers that attach X-Request-ID
  to every response and never leak upstream internals.
- SPEC + AGENT_PROMPT included under scope-docs/ (source of truth).
2026-05-26 20:50:35 +02:00

29 KiB

neuronetz-gateway — SPEC.md

Project: neuronetz-gateway Version: 0.1.0 (target) Status: Specification — not yet implemented License: Apache 2.0 Owner: Stephan Berbig / Neuronetz


1. Purpose

A secure, multi-tenant API gateway in front of an Ollama instance currently exposed at https://api.neuronetz.ai. The Ollama endpoint must never be reachable directly from the public internet again. All access flows through this gateway.

The gateway is the hot path of the Neuronetz API. A separate service (neuronetz-console, built on the Nibiru PHP framework) handles administration, dashboards, and tenant self-service. This SPEC covers only the gateway.

2. Scope

In scope (v0.1.0)

  • Authentication via API keys (Bearer tokens)
  • Multi-tenant data model (tenants → keys, with inheritance)
  • Per-key and per-tenant rate limiting (RPM, TPM, concurrent)
  • Per-key and per-tenant token budgets (daily, monthly, total)
  • Streaming and non-streaming proxy to Ollama
  • Dual API surface: native Ollama (/api/*) and OpenAI-compatible (/v1/*)
  • Endpoint allowlist (block all model-mutating Ollama endpoints)
  • Dynamic model discovery from the Ollama backend — the live set of installed models is queried, cached, and auto-refreshed; nothing about the model list is hand-maintained
  • Model allowlist (per-tenant override), default-deny, resolved against the live discovered set (stale/typo'd entries never resolve)
  • Per-tenant allow_all_models toggle — opt-in: a flagged tenant may use any currently-installed model, so models newly pulled into Ollama are auto-granted on the next discovery refresh
  • Request size limits, response size limits, timeouts
  • Token counting from Ollama responses (precise, not heuristic)
  • Audit log (always-on metadata)
  • Prompt log (opt-in per key, TTL'd retention)
  • Bootstrap CLI: create tenants, keys, set budgets
  • Health and readiness endpoints
  • Docker Compose deployment (gateway + caddy + postgres + redis + ollama)
  • Caddy as TLS terminator (Let's Encrypt for api.neuronetz.ai)

Out of scope (v0.1.0, document as future)

  • Web admin UI (lives in neuronetz-console, separate repo)
  • Billing / Stripe integration (budgets only, no money yet)
  • Multi-region / HA / k8s
  • Content moderation / prompt-injection filtering
  • Response caching
  • Multi-backend routing (one Ollama; pluggable backend interface stays for later)
  • Webhook notifications
  • SSO / OAuth2 for admin

3. Threat Model (abbreviated)

Threat Mitigation
Internet scanners hitting Ollama directly Ollama bound to internal Docker network; never published
Unauthenticated API abuse Mandatory Bearer token; fail-closed on auth errors
API key brute force Argon2id hashing; constant-time compare; rate limit on auth failures per source IP
GPU/token exhaustion (cost attack) Per-key TPM + token budget; per-tenant ceiling; concurrent connection cap
Resource exhaustion via large payloads Request body size limit (default 256 KiB); num_predict cap (default 4096)
Model enumeration / training-data exfil via uncommon models Model allowlist; default-deny. allow_all_models is opt-in per tenant and audited. Discovery only ever exposes models actually installed on the backend; /api/tags and /v1/models never reveal models outside the tenant's effective set; "not allowed" and "doesn't exist" return the same generic response
Discovery backend unreachable Fail-closed: an empty/stale-expired discovered set means no model resolves, so requests are denied — never "allow because we couldn't list models"
Ollama mutation (model pull/delete) by attacker Endpoint allowlist; mutating endpoints (/api/pull, /api/push, /api/create, /api/copy, /api/delete) hard-blocked at the gateway
Information disclosure via error messages Sanitize upstream errors; never proxy Ollama internals to client
Audit log tampering Append-only at app layer; DB role separation; optional WAL archiving
Prompt data leakage Prompt logging off by default; opt-in per key; TTL'd; redaction hook
Redis outage causing "fail open" Fail-closed: if rate-limit/budget backend is unavailable, deny
Compromised admin token Admin token lives in neuronetz-console, not in gateway; gateway has no admin endpoints

4. Architecture

4.1 Component diagram

                          Internet
                              │ TLS
                              ▼
                  ┌──────────────────────┐
                  │ Caddy (sidecar)      │  Let's Encrypt for api.neuronetz.ai
                  │ - TLS termination    │  HSTS, security headers
                  │ - HTTP/2, HTTP/3     │
                  └──────────┬───────────┘
                             │ HTTP/1.1 internal
                  ┌──────────▼───────────┐
                  │ neuronetz-gateway    │  FastAPI + uvicorn
                  │  - authn             │
                  │  - rate limit        │
                  │  - budget check      │
                  │  - proxy + stream    │
                  │  - token count       │
                  │  - audit write       │
                  └──┬────────┬──────┬───┘
                     │        │      │
              ┌──────▼──┐  ┌──▼───┐  │
              │Postgres │  │Redis │  │
              │ schema: │  │ keys │  │
              │ gateway │  │bucket│  │
              └─────────┘  └──────┘  │
                                     │ internal network only
                              ┌──────▼──────┐
                              │   Ollama    │
                              │ 127.0.0.1   │
                              └─────────────┘

Same Compose stack also hosts (separate from this SPEC):
  - neuronetz-console (PHP/Nibiru) → reads schema `console`, reads schema `gateway` (SELECT)

4.2 Database schemas

Single Postgres instance, two schemas:

  • gateway — owned by the gateway service; gateway role has full DDL
  • console — owned by neuronetz-console (out of scope here); console role has full DDL
  • Both services connect with their own role. Cross-schema access is explicit GRANT.

Console role gets SELECT on all gateway.* tables. Console writes go only to console.* tables. If the console needs to mutate gateway state (e.g. revoke a key), it does so by writing to a gateway.revocations outbox table that the gateway tails (see §4.5).

4.3 Request lifecycle

  1. Caddy terminates TLS, forwards to gateway on internal port.
  2. Gateway middleware extracts Authorization: Bearer <key>.
  3. Key prefix (first 12 chars) used as Redis cache key. On miss, lookup gateway.api_keys by prefix; verify full key with argon2id verify; cache resolved key metadata in Redis (TTL 60s).
  4. Rate limit check (sliding window in Redis, Lua-atomic) — per-key RPM + per-tenant RPM.
  5. Budget check (Redis counter for current period; Postgres ledger is source of truth on reset).
  6. Concurrent-connection semaphore (Redis INCR with TTL).
  7. Model allowlist check. Resolve the effective model set for the key: allow_all := key.allow_all_models ?? tenant.allow_all_models; effective := discovered if allow_all else (key.allowed_models ?? tenant.allowed_models) ∩ discovered, where discovered is the cached live model set from discovery (§4.6). The request's model must be in effective, else a generic 403 with no disclosure of whether the model exists but is unpermitted vs. is not installed.
  8. Endpoint allowlist check.
  9. Request body validation (size, schema, num_predict cap).
  10. If OpenAI-compat path, translate request to Ollama schema.
  11. Open httpx async stream to Ollama.
  12. Stream response back to client, accumulating final prompt_eval_count + eval_count.
  13. On stream close: write gateway.audit_log row; decrement budget; release semaphore; if prompt logging enabled, write gateway.prompt_log row.
  14. On any failure: sanitized error to client, audit row with status code, semaphore released.

4.4 Failure modes (fail-closed)

Subsystem If down Behavior
Postgres (read) Key lookup fails 503 with retry-after; no requests proxied
Postgres (write) Audit write fails Request still succeeds, audit row buffered in-memory ring (max 1000), drained on recovery; if buffer fills, switch to deny mode
Redis Rate limit / budget unavailable 503 — fail closed. Never "allow because we can't check."
Ollama Upstream unreachable 502 with retry-after; circuit breaker opens after 5 consecutive failures, half-open after 30s
Caddy Not a gateway concern

4.5 Cache invalidation (key revocation)

Console can revoke a key by inserting into gateway.revocations(key_id, ts, reason). Gateway has a background task (asyncio.create_task in lifespan) that:

  • LISTENs on Postgres channel key_revoked (gateway emits NOTIFY on its own write path; console emits via INSERT trigger)
  • On notification, evicts the Redis cache entry for that key's prefix
  • This makes revocation effectively immediate (≤ Redis RTT) without cross-service HTTP

4.6 Model discovery

The set of usable models is never hand-maintained; it is extracted live from the Ollama backend.

  • A background task (started in lifespan, like the revocation listener) polls Ollama GET /api/tags every MODEL_DISCOVERY_REFRESH_S seconds.
  • The parsed model set (names + sanitized metadata: family, parameter size, quantization, size bytes, modified-at) is cached in Redis under gateway:models:discovered with TTL MODEL_DISCOVERY_CACHE_TTL_S, and held in-process for hot reads on the request path.
  • On startup an initial fetch runs; if Ollama is unreachable the discovered set is empty.
  • Fail-closed: if the discovered set is empty or its cache has expired and cannot be refreshed, no model resolves and requests are denied (consistent with default-deny). Discovery never opens access on failure.
  • "Auto-grant": because the effective set (§4.3 step 7) intersects with discovered (or is discovered when allow_all_models), a model pulled into Ollama out-of-band becomes usable to allow_all tenants on the next refresh — no per-tenant config change.
  • Discovery is read-only against Ollama and uses only the allowlisted /api/tags endpoint; it never triggers a model pull.

5. Data Model (schema gateway)

CREATE SCHEMA gateway;

CREATE TYPE gateway.key_status AS ENUM ('active', 'disabled', 'revoked');
CREATE TYPE gateway.tenant_status AS ENUM ('active', 'suspended', 'closed');
CREATE TYPE gateway.budget_period AS ENUM ('day', 'month', 'total');

CREATE TABLE gateway.tenants (
    id              uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    name            text NOT NULL UNIQUE,
    status          gateway.tenant_status NOT NULL DEFAULT 'active',
    created_at      timestamptz NOT NULL DEFAULT now(),
    metadata        jsonb NOT NULL DEFAULT '{}'::jsonb
);

CREATE TABLE gateway.tenant_limits (
    tenant_id           uuid PRIMARY KEY REFERENCES gateway.tenants(id) ON DELETE CASCADE,
    rpm                 integer NOT NULL DEFAULT 60,
    tpm                 integer NOT NULL DEFAULT 100000,
    concurrent          integer NOT NULL DEFAULT 8,
    tokens_daily        bigint,
    tokens_monthly      bigint,
    tokens_total        bigint,
    allowed_models      text[] NOT NULL DEFAULT '{}',
    allow_all_models    boolean NOT NULL DEFAULT false,  -- opt-in: allow any installed model
    log_prompts_default boolean NOT NULL DEFAULT false,
    prompt_retention_days integer NOT NULL DEFAULT 30,
    audit_retention_days  integer NOT NULL DEFAULT 365
);

CREATE TABLE gateway.api_keys (
    id              uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       uuid NOT NULL REFERENCES gateway.tenants(id) ON DELETE CASCADE,
    prefix          text NOT NULL UNIQUE,          -- first 12 chars, indexed
    key_hash        text NOT NULL,                  -- argon2id
    name            text NOT NULL,
    status          gateway.key_status NOT NULL DEFAULT 'active',
    scopes          text[] NOT NULL DEFAULT '{chat,embeddings}',
    created_at      timestamptz NOT NULL DEFAULT now(),
    last_used_at    timestamptz,
    expires_at      timestamptz,
    log_prompts     boolean,                        -- NULL = inherit from tenant
    metadata        jsonb NOT NULL DEFAULT '{}'::jsonb
);

CREATE INDEX idx_api_keys_prefix ON gateway.api_keys(prefix) WHERE status = 'active';
CREATE INDEX idx_api_keys_tenant ON gateway.api_keys(tenant_id);

CREATE TABLE gateway.key_limits (
    key_id              uuid PRIMARY KEY REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
    rpm                 integer,            -- NULL = inherit tenant
    tpm                 integer,
    concurrent          integer,
    tokens_daily        bigint,
    tokens_monthly      bigint,
    tokens_total        bigint,
    allowed_models      text[],             -- NULL = inherit tenant
    allow_all_models    boolean             -- NULL = inherit tenant
);

CREATE TABLE gateway.budget_usage (
    key_id          uuid NOT NULL REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
    period          gateway.budget_period NOT NULL,
    period_start    timestamptz NOT NULL,
    tokens_in       bigint NOT NULL DEFAULT 0,
    tokens_out      bigint NOT NULL DEFAULT 0,
    requests        bigint NOT NULL DEFAULT 0,
    PRIMARY KEY (key_id, period, period_start)
);

CREATE INDEX idx_budget_usage_period ON gateway.budget_usage(period, period_start);

CREATE TABLE gateway.audit_log (
    id              bigserial PRIMARY KEY,
    ts              timestamptz NOT NULL DEFAULT now(),
    request_id      uuid NOT NULL,
    tenant_id       uuid,                          -- nullable for auth-failed rows
    key_id          uuid,
    key_prefix      text,                          -- denormalized for forensic queries
    method          text NOT NULL,
    path            text NOT NULL,
    model           text,
    tokens_in       integer,
    tokens_out      integer,
    latency_ms      integer,
    status          integer NOT NULL,
    client_ip       inet,
    user_agent      text,
    error_code      text
);

CREATE INDEX idx_audit_ts ON gateway.audit_log(ts);
CREATE INDEX idx_audit_tenant_ts ON gateway.audit_log(tenant_id, ts);
CREATE INDEX idx_audit_key_ts ON gateway.audit_log(key_id, ts);

CREATE TABLE gateway.prompt_log (
    id              bigserial PRIMARY KEY,
    audit_id        bigint NOT NULL REFERENCES gateway.audit_log(id) ON DELETE CASCADE,
    ts              timestamptz NOT NULL DEFAULT now(),
    key_id          uuid NOT NULL,
    request_body    jsonb NOT NULL,
    response_text   text,
    retention_until timestamptz NOT NULL
);

CREATE INDEX idx_prompt_log_retention ON gateway.prompt_log(retention_until);

CREATE TABLE gateway.revocations (
    id              bigserial PRIMARY KEY,
    key_id          uuid NOT NULL,
    ts              timestamptz NOT NULL DEFAULT now(),
    reason          text,
    processed_at    timestamptz
);

-- Trigger to NOTIFY on revocation insert
CREATE OR REPLACE FUNCTION gateway.notify_key_revoked() RETURNS trigger AS $$
BEGIN
    PERFORM pg_notify('key_revoked', NEW.key_id::text);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_notify_key_revoked
    AFTER INSERT ON gateway.revocations
    FOR EACH ROW EXECUTE FUNCTION gateway.notify_key_revoked();

-- Grants for console role (created in console SPEC, referenced here)
-- GRANT USAGE ON SCHEMA gateway TO console_role;
-- GRANT SELECT ON ALL TABLES IN SCHEMA gateway TO console_role;
-- GRANT INSERT ON gateway.revocations TO console_role;

6. API Surface

6.1 Native Ollama passthrough (allowlisted)

Path Method Notes
/api/chat POST Streamed (NDJSON) and non-streamed
/api/generate POST Streamed (NDJSON) and non-streamed
/api/embeddings POST Non-streamed
/api/embed POST Newer Ollama embeddings endpoint
/api/tags GET Returns the tenant's effective model set (live-discovered ∩ allowed, or all discovered when allow_all_models). Sourced from discovery (§4.6), never a static list
/api/show POST Allowed only for models in the tenant's effective set; returns sanitized model info (no system prompts, no template)
/api/ps GET Blocked — leaks loaded models
/api/version GET Returns gateway version, not Ollama version

6.2 Hard-blocked Ollama endpoints (always 403)

/api/pull, /api/push, /api/create, /api/copy, /api/delete, /api/blobs/*

6.3 OpenAI-compatible

Path Method Maps to
/v1/chat/completions POST /api/chat
/v1/completions POST /api/generate
/v1/embeddings POST /api/embed
/v1/models GET /api/tags (the tenant's effective discovered set), in OpenAI model-list format

Translation must preserve streaming. SSE (data: {...}\n\n) for OpenAI-compat; NDJSON for native.

6.4 Gateway endpoints

Path Method Auth Purpose
/healthz GET none Liveness — process responsive
/readyz GET none Readiness — DB + Redis + Ollama all reachable
/metrics GET none (loopback only) Prometheus exposition (counters, histograms)

No admin endpoints. Admin lives in neuronetz-console.

6.5 Response headers

Every proxied response carries:

  • X-Request-ID: <uuid>
  • X-RateLimit-Limit-Requests: <n>
  • X-RateLimit-Remaining-Requests: <n>
  • X-RateLimit-Limit-Tokens: <n>
  • X-RateLimit-Remaining-Tokens: <n>
  • X-Budget-Period: day|month|total
  • X-Budget-Tokens-Remaining: <n>

429 responses additionally carry Retry-After: <seconds>.

7. Configuration

All via environment variables, validated by Pydantic Settings on boot. Boot fails loudly on invalid config.

# Service
GATEWAY_BIND_HOST=0.0.0.0
GATEWAY_BIND_PORT=8080
GATEWAY_LOG_LEVEL=INFO
GATEWAY_LOG_FORMAT=json                  # json|console
GATEWAY_REQUEST_ID_HEADER=X-Request-ID
GATEWAY_TRUSTED_PROXIES=127.0.0.1,caddy  # for X-Forwarded-For

# Upstream
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_CONNECT_TIMEOUT_S=5
OLLAMA_READ_TIMEOUT_S=600
OLLAMA_MAX_CONNECTIONS=64

# Model discovery (§4.6)
MODEL_DISCOVERY_REFRESH_S=60             # how often to re-query Ollama /api/tags
MODEL_DISCOVERY_CACHE_TTL_S=120          # Redis cache TTL for the discovered model set

# Database
DATABASE_URL=postgresql+asyncpg://gateway:...@postgres:5432/neuronetz
DATABASE_POOL_SIZE=10
DATABASE_POOL_OVERFLOW=20

# Redis
REDIS_URL=redis://redis:6379/0
REDIS_KEY_CACHE_TTL_S=60

# Limits (defaults; per-tenant/key overrides in DB)
DEFAULT_RPM=60
DEFAULT_TPM=100000
DEFAULT_CONCURRENT=8
MAX_REQUEST_BODY_BYTES=262144
MAX_NUM_PREDICT=4096

# Security
ARGON2_TIME_COST=3
ARGON2_MEMORY_COST_KIB=65536
ARGON2_PARALLELISM=4
AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN=20

# Audit
AUDIT_BUFFER_SIZE=1000
PROMPT_LOG_DEFAULT_RETENTION_DAYS=30
AUDIT_LOG_DEFAULT_RETENTION_DAYS=365

8. Repository Layout

neuronetz-gateway/
├── pyproject.toml                # uv-managed, ruff, mypy --strict, pytest
├── README.md
├── LICENSE                       # Apache 2.0
├── docker-compose.yml            # full stack incl. console placeholder
├── docker-compose.dev.yml        # without caddy, gateway exposed on localhost
├── Dockerfile                    # multi-stage, python:3.12-slim base
├── .env.example
├── .dockerignore
├── .gitignore
├── alembic.ini
├── alembic/
│   ├── env.py
│   └── versions/
│       └── 0001_initial.py       # creates schema `gateway` and all tables
├── ops/
│   ├── caddy/
│   │   └── Caddyfile.example
│   └── systemd/
│       └── neuronetz-gateway.service
├── src/neuronetz_gateway/
│   ├── __init__.py
│   ├── __main__.py               # uvicorn entry
│   ├── app.py                    # FastAPI factory
│   ├── config.py                 # Pydantic Settings
│   ├── deps.py                   # DI providers
│   ├── lifespan.py               # startup/shutdown, NOTIFY listener
│   ├── errors.py                 # exception types, handlers, sanitization
│   ├── auth/
│   │   ├── __init__.py
│   │   ├── hashing.py            # argon2id wrapper
│   │   ├── keys.py               # key generation, prefix, verify
│   │   └── middleware.py
│   ├── ratelimit/
│   │   ├── __init__.py
│   │   ├── sliding_window.py     # Redis Lua script
│   │   └── concurrency.py        # semaphore via Redis
│   ├── budget/
│   │   ├── __init__.py
│   │   ├── counter.py            # Redis period counters
│   │   └── ledger.py             # Postgres reconciliation
│   ├── proxy/
│   │   ├── __init__.py
│   │   ├── ollama.py             # httpx streaming client
│   │   ├── translate.py          # OpenAI <-> Ollama schemas
│   │   ├── token_counter.py      # parse usage from stream
│   │   ├── discovery.py          # live model discovery from Ollama /api/tags (§4.6)
│   │   └── allowlist.py          # effective-set resolution (allow_all / allowed ∩ discovered)
│   ├── routes/
│   │   ├── __init__.py
│   │   ├── ollama_native.py
│   │   ├── openai_compat.py
│   │   └── health.py
│   ├── db/
│   │   ├── __init__.py
│   │   ├── session.py
│   │   ├── models.py             # SQLAlchemy 2.0
│   │   └── repositories.py
│   ├── audit/
│   │   ├── __init__.py
│   │   ├── writer.py             # buffered async writer
│   │   └── prompt_log.py
│   ├── observability/
│   │   ├── __init__.py
│   │   ├── logging.py            # structlog config
│   │   └── metrics.py            # prometheus
│   └── cli/
│       ├── __init__.py
│       └── manage.py             # typer: create-tenant, create-key, ...
├── tests/
│   ├── conftest.py               # testcontainers fixtures
│   ├── unit/
│   │   ├── test_hashing.py
│   │   ├── test_translate.py
│   │   ├── test_token_counter.py
│   │   ├── test_discovery.py
│   │   ├── test_allowlist.py
│   │   └── test_sliding_window.py
│   ├── integration/
│   │   ├── test_auth_flow.py
│   │   ├── test_rate_limit.py
│   │   ├── test_budget.py
│   │   ├── test_proxy_stream.py
│   │   ├── test_openai_compat.py
│   │   ├── test_revocation.py
│   │   └── mock_ollama.py        # FastAPI mock with NDJSON/SSE
│   └── load/
│       └── locustfile.py
└── docs/
    ├── ARCHITECTURE.md
    ├── DEPLOYMENT.md
    ├── API.md
    ├── THREAT_MODEL.md
    └── OPERATIONS.md              # runbook: revoke key, rotate, check usage

9. Non-Functional Requirements

  • Performance: p50 overhead < 5 ms over direct Ollama call (auth + ratelimit + audit); p99 < 25 ms (excluding upstream latency)
  • Streaming: Time-to-first-byte must not be degraded by gateway logic — audit write happens after stream close
  • Memory: Steady-state RSS < 200 MiB per gateway worker under 100 concurrent streams
  • Concurrency: Handle 200 concurrent connections per worker; 4 workers per instance default
  • Test coverage: ≥ 85% line coverage on src/neuronetz_gateway/ excluding __main__ and CLI; 100% on auth/, ratelimit/, budget/
  • Security: No eval, no exec, no shell-out, no pickle. Bandit clean. pip-audit clean on every CI run.
  • Type safety: mypy --strict clean
  • Lint: ruff check clean with project ruleset (E, F, I, B, UP, S, ASYNC)

10. Tooling

  • Python 3.12
  • uv for dependency management (pyproject.toml + uv.lock)
  • FastAPI ≥ 0.115, uvicorn[standard], httpx ≥ 0.27, SQLAlchemy 2.0 (async), asyncpg, redis ≥ 5.0 (with hiredis), structlog, pydantic ≥ 2.9, pydantic-settings, argon2-cffi, typer, prometheus-client
  • Test: pytest, pytest-asyncio, pytest-cov, testcontainers, httpx (test client), respx (mock), locust
  • Lint/format: ruff, mypy --strict, bandit, pip-audit
  • CI: GitHub Actions workflow (lint, type, test with coverage, build image, push on tag)

11. Bootstrap CLI (Typer)

neuronetz-gateway create-tenant --name "acme" [--rpm 60] [--tpm 100000]
neuronetz-gateway create-key --tenant acme --name "prod-server-1" [--scopes chat,embeddings]
neuronetz-gateway revoke-key --prefix nz_abc12345
neuronetz-gateway list-keys --tenant acme
neuronetz-gateway show-usage --tenant acme [--period day|month|total]
neuronetz-gateway set-budget --key nz_abc12345 --daily 1000000 --monthly 30000000
neuronetz-gateway set-models --tenant acme --models llama3.1:8b,mistral:7b
neuronetz-gateway set-models --tenant acme --allow-all          # opt into allow_all_models
neuronetz-gateway set-models --tenant acme --no-allow-all       # back to explicit allowlist
neuronetz-gateway list-models [--tenant acme]                   # show live-discovered models
                                                                # (and the tenant's effective set)

create-tenant accepts --allow-all-models / --no-allow-all-models (default off). list-models reads the discovery cache (§4.6); with --tenant it also shows that tenant's resolved effective set.

Key format: nz_<12-char-prefix><32-char-random>. Prefix is stored; full key is hashed (argon2id). On creation, the full key is printed exactly once.

12. Acceptance Criteria

The build is "done" when every box below is checked. The orchestrator must verify each before declaring v0.1.0.

  • docker compose up from a clean checkout produces a running stack with TLS via Caddy (self-signed in dev, Let's Encrypt-ready in prod).
  • CLI creates tenant and key; printed key successfully authenticates an /api/chat call.
  • Unauthenticated request returns 401 with no Ollama details leaked.
  • Request to /api/pull returns 403 with generic error message.
  • Streaming /api/chat works end-to-end; first byte arrives within Ollama's own TTFB + < 10 ms gateway overhead.
  • Streaming /v1/chat/completions returns valid SSE with data: [DONE] terminator.
  • Token counts in audit log match Ollama's reported prompt_eval_count + eval_count exactly.
  • /api/tags and /v1/models reflect the live Ollama model set (discovery, §4.6): an allow_all_models tenant sees every installed model and a newly-pulled model appears within one refresh interval; a default-deny tenant sees only allowed_models ∩ discovered; a request for a model outside the effective set returns a generic 403; with discovery unavailable, requests fail closed (deny), not open.
  • Rate limit triggers at configured RPM with Retry-After header.
  • Token budget enforces and blocks at zero remaining with descriptive error.
  • Redis outage causes 503 (fail-closed), not 200.
  • Revocation via INSERT INTO gateway.revocations evicts Redis cache within 1 second.
  • mypy --strict, ruff check, bandit, pip-audit all clean in CI.
  • Test coverage ≥ 85% overall, 100% in auth/, ratelimit/, budget/.
  • docs/THREAT_MODEL.md, docs/DEPLOYMENT.md, docs/OPERATIONS.md present and accurate.
  • Load test (locust): 100 concurrent users sustained 5 minutes, p99 gateway overhead < 25 ms, zero 5xx outside induced failures.

13. Open Questions (decide during build)

  1. Embedding cost accounting — Ollama doesn't return eval_count for embeddings. Decision: charge based on prompt_eval_count only; document as such.
  2. SSE vs NDJSON heuristic for OpenAI-compat — always SSE per OpenAI spec. NDJSON only on native /api/*.
  3. Prometheus cardinality — do not label by key_id (too many series); label by tenant_id only; per-key data lives in Postgres.
  4. Model discovery source — the live model list is GET /api/tags on the Ollama backend; there is no separate registry. Cached in Redis + in-process, refreshed every MODEL_DISCOVERY_REFRESH_S.
  5. Discovery failure is fail-closed — empty/expired discovered set ⇒ no model resolves ⇒ deny. Discovery never opens access on error.
  6. No existence disclosure — a model that is installed-but-unpermitted and a model that is not installed both return the same generic response, to prevent enumeration.
  7. allow_all_models precedence — key-level allow_all_models (when non-NULL) overrides the tenant flag; otherwise the tenant flag applies. Same NULL-inherits-tenant rule as the other key limits.

14. References