Initial project structure for neuronetz-gateway per scope-docs/SPEC.md: - Python 3.12 / FastAPI / SQLAlchemy 2.0 (async) / Redis / Postgres stack managed by uv. Multi-stage non-root Dockerfile, prod + dev compose files (ollama service is NEVER published in either), Caddyfile + systemd unit, justfile, GitHub Actions CI (ruff, mypy --strict, pytest, bandit, pip-audit). - Pydantic-Settings config covering every env var from SPEC §7, including the MODEL_DISCOVERY_* keys for the dynamic-discovery feature (§4.6). - Alembic 0001_initial creates the full gateway schema (8 tables, 3 enums, notify_key_revoked() trigger), incl. allow_all_models on tenant_limits and key_limits for the per-tenant auto-grant toggle. - Working /healthz, /readyz (fail-closed when deps unreachable), and a Prometheus /metrics stub. Sanitizing error handlers that attach X-Request-ID to every response and never leak upstream internals. - SPEC + AGENT_PROMPT included under scope-docs/ (source of truth).
29 KiB
neuronetz-gateway — SPEC.md
Project: neuronetz-gateway
Version: 0.1.0 (target)
Status: Specification — not yet implemented
License: Apache 2.0
Owner: Stephan Berbig / Neuronetz
1. Purpose
A secure, multi-tenant API gateway in front of an Ollama instance currently exposed at https://api.neuronetz.ai. The Ollama endpoint must never be reachable directly from the public internet again. All access flows through this gateway.
The gateway is the hot path of the Neuronetz API. A separate service (neuronetz-console, built on the Nibiru PHP framework) handles administration, dashboards, and tenant self-service. This SPEC covers only the gateway.
2. Scope
In scope (v0.1.0)
- Authentication via API keys (Bearer tokens)
- Multi-tenant data model (tenants → keys, with inheritance)
- Per-key and per-tenant rate limiting (RPM, TPM, concurrent)
- Per-key and per-tenant token budgets (daily, monthly, total)
- Streaming and non-streaming proxy to Ollama
- Dual API surface: native Ollama (
/api/*) and OpenAI-compatible (/v1/*) - Endpoint allowlist (block all model-mutating Ollama endpoints)
- Dynamic model discovery from the Ollama backend — the live set of installed models is queried, cached, and auto-refreshed; nothing about the model list is hand-maintained
- Model allowlist (per-tenant override), default-deny, resolved against the live discovered set (stale/typo'd entries never resolve)
- Per-tenant
allow_all_modelstoggle — opt-in: a flagged tenant may use any currently-installed model, so models newly pulled into Ollama are auto-granted on the next discovery refresh - Request size limits, response size limits, timeouts
- Token counting from Ollama responses (precise, not heuristic)
- Audit log (always-on metadata)
- Prompt log (opt-in per key, TTL'd retention)
- Bootstrap CLI: create tenants, keys, set budgets
- Health and readiness endpoints
- Docker Compose deployment (gateway + caddy + postgres + redis + ollama)
- Caddy as TLS terminator (Let's Encrypt for
api.neuronetz.ai)
Out of scope (v0.1.0, document as future)
- Web admin UI (lives in
neuronetz-console, separate repo) - Billing / Stripe integration (budgets only, no money yet)
- Multi-region / HA / k8s
- Content moderation / prompt-injection filtering
- Response caching
- Multi-backend routing (one Ollama; pluggable backend interface stays for later)
- Webhook notifications
- SSO / OAuth2 for admin
3. Threat Model (abbreviated)
| Threat | Mitigation |
|---|---|
| Internet scanners hitting Ollama directly | Ollama bound to internal Docker network; never published |
| Unauthenticated API abuse | Mandatory Bearer token; fail-closed on auth errors |
| API key brute force | Argon2id hashing; constant-time compare; rate limit on auth failures per source IP |
| GPU/token exhaustion (cost attack) | Per-key TPM + token budget; per-tenant ceiling; concurrent connection cap |
| Resource exhaustion via large payloads | Request body size limit (default 256 KiB); num_predict cap (default 4096) |
| Model enumeration / training-data exfil via uncommon models | Model allowlist; default-deny. allow_all_models is opt-in per tenant and audited. Discovery only ever exposes models actually installed on the backend; /api/tags and /v1/models never reveal models outside the tenant's effective set; "not allowed" and "doesn't exist" return the same generic response |
| Discovery backend unreachable | Fail-closed: an empty/stale-expired discovered set means no model resolves, so requests are denied — never "allow because we couldn't list models" |
| Ollama mutation (model pull/delete) by attacker | Endpoint allowlist; mutating endpoints (/api/pull, /api/push, /api/create, /api/copy, /api/delete) hard-blocked at the gateway |
| Information disclosure via error messages | Sanitize upstream errors; never proxy Ollama internals to client |
| Audit log tampering | Append-only at app layer; DB role separation; optional WAL archiving |
| Prompt data leakage | Prompt logging off by default; opt-in per key; TTL'd; redaction hook |
| Redis outage causing "fail open" | Fail-closed: if rate-limit/budget backend is unavailable, deny |
| Compromised admin token | Admin token lives in neuronetz-console, not in gateway; gateway has no admin endpoints |
4. Architecture
4.1 Component diagram
Internet
│ TLS
▼
┌──────────────────────┐
│ Caddy (sidecar) │ Let's Encrypt for api.neuronetz.ai
│ - TLS termination │ HSTS, security headers
│ - HTTP/2, HTTP/3 │
└──────────┬───────────┘
│ HTTP/1.1 internal
┌──────────▼───────────┐
│ neuronetz-gateway │ FastAPI + uvicorn
│ - authn │
│ - rate limit │
│ - budget check │
│ - proxy + stream │
│ - token count │
│ - audit write │
└──┬────────┬──────┬───┘
│ │ │
┌──────▼──┐ ┌──▼───┐ │
│Postgres │ │Redis │ │
│ schema: │ │ keys │ │
│ gateway │ │bucket│ │
└─────────┘ └──────┘ │
│ internal network only
┌──────▼──────┐
│ Ollama │
│ 127.0.0.1 │
└─────────────┘
Same Compose stack also hosts (separate from this SPEC):
- neuronetz-console (PHP/Nibiru) → reads schema `console`, reads schema `gateway` (SELECT)
4.2 Database schemas
Single Postgres instance, two schemas:
gateway— owned by the gateway service; gateway role has full DDLconsole— owned byneuronetz-console(out of scope here); console role has full DDL- Both services connect with their own role. Cross-schema access is explicit GRANT.
Console role gets SELECT on all gateway.* tables. Console writes go only to console.* tables. If the console needs to mutate gateway state (e.g. revoke a key), it does so by writing to a gateway.revocations outbox table that the gateway tails (see §4.5).
4.3 Request lifecycle
- Caddy terminates TLS, forwards to gateway on internal port.
- Gateway middleware extracts
Authorization: Bearer <key>. - Key prefix (first 12 chars) used as Redis cache key. On miss, lookup
gateway.api_keysby prefix; verify full key with argon2idverify; cache resolved key metadata in Redis (TTL 60s). - Rate limit check (sliding window in Redis, Lua-atomic) — per-key RPM + per-tenant RPM.
- Budget check (Redis counter for current period; Postgres ledger is source of truth on reset).
- Concurrent-connection semaphore (Redis
INCRwith TTL). - Model allowlist check. Resolve the effective model set for the key:
allow_all := key.allow_all_models ?? tenant.allow_all_models;effective := discoveredifallow_allelse(key.allowed_models ?? tenant.allowed_models) ∩ discovered, wherediscoveredis the cached live model set from discovery (§4.6). The request'smodelmust be ineffective, else a generic 403 with no disclosure of whether the model exists but is unpermitted vs. is not installed. - Endpoint allowlist check.
- Request body validation (size, schema,
num_predictcap). - If OpenAI-compat path, translate request to Ollama schema.
- Open httpx async stream to Ollama.
- Stream response back to client, accumulating final
prompt_eval_count+eval_count. - On stream close: write
gateway.audit_logrow; decrement budget; release semaphore; if prompt logging enabled, writegateway.prompt_logrow. - On any failure: sanitized error to client, audit row with status code, semaphore released.
4.4 Failure modes (fail-closed)
| Subsystem | If down | Behavior |
|---|---|---|
| Postgres (read) | Key lookup fails | 503 with retry-after; no requests proxied |
| Postgres (write) | Audit write fails | Request still succeeds, audit row buffered in-memory ring (max 1000), drained on recovery; if buffer fills, switch to deny mode |
| Redis | Rate limit / budget unavailable | 503 — fail closed. Never "allow because we can't check." |
| Ollama | Upstream unreachable | 502 with retry-after; circuit breaker opens after 5 consecutive failures, half-open after 30s |
| Caddy | Not a gateway concern | — |
4.5 Cache invalidation (key revocation)
Console can revoke a key by inserting into gateway.revocations(key_id, ts, reason). Gateway has a background task (asyncio.create_task in lifespan) that:
- LISTENs on Postgres channel
key_revoked(gateway emits NOTIFY on its own write path; console emits via INSERT trigger) - On notification, evicts the Redis cache entry for that key's prefix
- This makes revocation effectively immediate (≤ Redis RTT) without cross-service HTTP
4.6 Model discovery
The set of usable models is never hand-maintained; it is extracted live from the Ollama backend.
- A background task (started in lifespan, like the revocation listener) polls Ollama
GET /api/tagseveryMODEL_DISCOVERY_REFRESH_Sseconds. - The parsed model set (names + sanitized metadata: family, parameter size, quantization,
size bytes, modified-at) is cached in Redis under
gateway:models:discoveredwith TTLMODEL_DISCOVERY_CACHE_TTL_S, and held in-process for hot reads on the request path. - On startup an initial fetch runs; if Ollama is unreachable the discovered set is empty.
- Fail-closed: if the discovered set is empty or its cache has expired and cannot be refreshed, no model resolves and requests are denied (consistent with default-deny). Discovery never opens access on failure.
- "Auto-grant": because the effective set (§4.3 step 7) intersects with
discovered(or isdiscoveredwhenallow_all_models), a model pulled into Ollama out-of-band becomes usable toallow_alltenants on the next refresh — no per-tenant config change. - Discovery is read-only against Ollama and uses only the allowlisted
/api/tagsendpoint; it never triggers a model pull.
5. Data Model (schema gateway)
CREATE SCHEMA gateway;
CREATE TYPE gateway.key_status AS ENUM ('active', 'disabled', 'revoked');
CREATE TYPE gateway.tenant_status AS ENUM ('active', 'suspended', 'closed');
CREATE TYPE gateway.budget_period AS ENUM ('day', 'month', 'total');
CREATE TABLE gateway.tenants (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
name text NOT NULL UNIQUE,
status gateway.tenant_status NOT NULL DEFAULT 'active',
created_at timestamptz NOT NULL DEFAULT now(),
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
CREATE TABLE gateway.tenant_limits (
tenant_id uuid PRIMARY KEY REFERENCES gateway.tenants(id) ON DELETE CASCADE,
rpm integer NOT NULL DEFAULT 60,
tpm integer NOT NULL DEFAULT 100000,
concurrent integer NOT NULL DEFAULT 8,
tokens_daily bigint,
tokens_monthly bigint,
tokens_total bigint,
allowed_models text[] NOT NULL DEFAULT '{}',
allow_all_models boolean NOT NULL DEFAULT false, -- opt-in: allow any installed model
log_prompts_default boolean NOT NULL DEFAULT false,
prompt_retention_days integer NOT NULL DEFAULT 30,
audit_retention_days integer NOT NULL DEFAULT 365
);
CREATE TABLE gateway.api_keys (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id uuid NOT NULL REFERENCES gateway.tenants(id) ON DELETE CASCADE,
prefix text NOT NULL UNIQUE, -- first 12 chars, indexed
key_hash text NOT NULL, -- argon2id
name text NOT NULL,
status gateway.key_status NOT NULL DEFAULT 'active',
scopes text[] NOT NULL DEFAULT '{chat,embeddings}',
created_at timestamptz NOT NULL DEFAULT now(),
last_used_at timestamptz,
expires_at timestamptz,
log_prompts boolean, -- NULL = inherit from tenant
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
CREATE INDEX idx_api_keys_prefix ON gateway.api_keys(prefix) WHERE status = 'active';
CREATE INDEX idx_api_keys_tenant ON gateway.api_keys(tenant_id);
CREATE TABLE gateway.key_limits (
key_id uuid PRIMARY KEY REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
rpm integer, -- NULL = inherit tenant
tpm integer,
concurrent integer,
tokens_daily bigint,
tokens_monthly bigint,
tokens_total bigint,
allowed_models text[], -- NULL = inherit tenant
allow_all_models boolean -- NULL = inherit tenant
);
CREATE TABLE gateway.budget_usage (
key_id uuid NOT NULL REFERENCES gateway.api_keys(id) ON DELETE CASCADE,
period gateway.budget_period NOT NULL,
period_start timestamptz NOT NULL,
tokens_in bigint NOT NULL DEFAULT 0,
tokens_out bigint NOT NULL DEFAULT 0,
requests bigint NOT NULL DEFAULT 0,
PRIMARY KEY (key_id, period, period_start)
);
CREATE INDEX idx_budget_usage_period ON gateway.budget_usage(period, period_start);
CREATE TABLE gateway.audit_log (
id bigserial PRIMARY KEY,
ts timestamptz NOT NULL DEFAULT now(),
request_id uuid NOT NULL,
tenant_id uuid, -- nullable for auth-failed rows
key_id uuid,
key_prefix text, -- denormalized for forensic queries
method text NOT NULL,
path text NOT NULL,
model text,
tokens_in integer,
tokens_out integer,
latency_ms integer,
status integer NOT NULL,
client_ip inet,
user_agent text,
error_code text
);
CREATE INDEX idx_audit_ts ON gateway.audit_log(ts);
CREATE INDEX idx_audit_tenant_ts ON gateway.audit_log(tenant_id, ts);
CREATE INDEX idx_audit_key_ts ON gateway.audit_log(key_id, ts);
CREATE TABLE gateway.prompt_log (
id bigserial PRIMARY KEY,
audit_id bigint NOT NULL REFERENCES gateway.audit_log(id) ON DELETE CASCADE,
ts timestamptz NOT NULL DEFAULT now(),
key_id uuid NOT NULL,
request_body jsonb NOT NULL,
response_text text,
retention_until timestamptz NOT NULL
);
CREATE INDEX idx_prompt_log_retention ON gateway.prompt_log(retention_until);
CREATE TABLE gateway.revocations (
id bigserial PRIMARY KEY,
key_id uuid NOT NULL,
ts timestamptz NOT NULL DEFAULT now(),
reason text,
processed_at timestamptz
);
-- Trigger to NOTIFY on revocation insert
CREATE OR REPLACE FUNCTION gateway.notify_key_revoked() RETURNS trigger AS $$
BEGIN
PERFORM pg_notify('key_revoked', NEW.key_id::text);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_notify_key_revoked
AFTER INSERT ON gateway.revocations
FOR EACH ROW EXECUTE FUNCTION gateway.notify_key_revoked();
-- Grants for console role (created in console SPEC, referenced here)
-- GRANT USAGE ON SCHEMA gateway TO console_role;
-- GRANT SELECT ON ALL TABLES IN SCHEMA gateway TO console_role;
-- GRANT INSERT ON gateway.revocations TO console_role;
6. API Surface
6.1 Native Ollama passthrough (allowlisted)
| Path | Method | Notes |
|---|---|---|
/api/chat |
POST | Streamed (NDJSON) and non-streamed |
/api/generate |
POST | Streamed (NDJSON) and non-streamed |
/api/embeddings |
POST | Non-streamed |
/api/embed |
POST | Newer Ollama embeddings endpoint |
/api/tags |
GET | Returns the tenant's effective model set (live-discovered ∩ allowed, or all discovered when allow_all_models). Sourced from discovery (§4.6), never a static list |
/api/show |
POST | Allowed only for models in the tenant's effective set; returns sanitized model info (no system prompts, no template) |
/api/ps |
GET | Blocked — leaks loaded models |
/api/version |
GET | Returns gateway version, not Ollama version |
6.2 Hard-blocked Ollama endpoints (always 403)
/api/pull, /api/push, /api/create, /api/copy, /api/delete, /api/blobs/*
6.3 OpenAI-compatible
| Path | Method | Maps to |
|---|---|---|
/v1/chat/completions |
POST | /api/chat |
/v1/completions |
POST | /api/generate |
/v1/embeddings |
POST | /api/embed |
/v1/models |
GET | /api/tags (the tenant's effective discovered set), in OpenAI model-list format |
Translation must preserve streaming. SSE (data: {...}\n\n) for OpenAI-compat; NDJSON for native.
6.4 Gateway endpoints
| Path | Method | Auth | Purpose |
|---|---|---|---|
/healthz |
GET | none | Liveness — process responsive |
/readyz |
GET | none | Readiness — DB + Redis + Ollama all reachable |
/metrics |
GET | none (loopback only) | Prometheus exposition (counters, histograms) |
No admin endpoints. Admin lives in neuronetz-console.
6.5 Response headers
Every proxied response carries:
X-Request-ID: <uuid>X-RateLimit-Limit-Requests: <n>X-RateLimit-Remaining-Requests: <n>X-RateLimit-Limit-Tokens: <n>X-RateLimit-Remaining-Tokens: <n>X-Budget-Period: day|month|totalX-Budget-Tokens-Remaining: <n>
429 responses additionally carry Retry-After: <seconds>.
7. Configuration
All via environment variables, validated by Pydantic Settings on boot. Boot fails loudly on invalid config.
# Service
GATEWAY_BIND_HOST=0.0.0.0
GATEWAY_BIND_PORT=8080
GATEWAY_LOG_LEVEL=INFO
GATEWAY_LOG_FORMAT=json # json|console
GATEWAY_REQUEST_ID_HEADER=X-Request-ID
GATEWAY_TRUSTED_PROXIES=127.0.0.1,caddy # for X-Forwarded-For
# Upstream
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_CONNECT_TIMEOUT_S=5
OLLAMA_READ_TIMEOUT_S=600
OLLAMA_MAX_CONNECTIONS=64
# Model discovery (§4.6)
MODEL_DISCOVERY_REFRESH_S=60 # how often to re-query Ollama /api/tags
MODEL_DISCOVERY_CACHE_TTL_S=120 # Redis cache TTL for the discovered model set
# Database
DATABASE_URL=postgresql+asyncpg://gateway:...@postgres:5432/neuronetz
DATABASE_POOL_SIZE=10
DATABASE_POOL_OVERFLOW=20
# Redis
REDIS_URL=redis://redis:6379/0
REDIS_KEY_CACHE_TTL_S=60
# Limits (defaults; per-tenant/key overrides in DB)
DEFAULT_RPM=60
DEFAULT_TPM=100000
DEFAULT_CONCURRENT=8
MAX_REQUEST_BODY_BYTES=262144
MAX_NUM_PREDICT=4096
# Security
ARGON2_TIME_COST=3
ARGON2_MEMORY_COST_KIB=65536
ARGON2_PARALLELISM=4
AUTH_FAILURE_RATE_LIMIT_PER_IP_PER_MIN=20
# Audit
AUDIT_BUFFER_SIZE=1000
PROMPT_LOG_DEFAULT_RETENTION_DAYS=30
AUDIT_LOG_DEFAULT_RETENTION_DAYS=365
8. Repository Layout
neuronetz-gateway/
├── pyproject.toml # uv-managed, ruff, mypy --strict, pytest
├── README.md
├── LICENSE # Apache 2.0
├── docker-compose.yml # full stack incl. console placeholder
├── docker-compose.dev.yml # without caddy, gateway exposed on localhost
├── Dockerfile # multi-stage, python:3.12-slim base
├── .env.example
├── .dockerignore
├── .gitignore
├── alembic.ini
├── alembic/
│ ├── env.py
│ └── versions/
│ └── 0001_initial.py # creates schema `gateway` and all tables
├── ops/
│ ├── caddy/
│ │ └── Caddyfile.example
│ └── systemd/
│ └── neuronetz-gateway.service
├── src/neuronetz_gateway/
│ ├── __init__.py
│ ├── __main__.py # uvicorn entry
│ ├── app.py # FastAPI factory
│ ├── config.py # Pydantic Settings
│ ├── deps.py # DI providers
│ ├── lifespan.py # startup/shutdown, NOTIFY listener
│ ├── errors.py # exception types, handlers, sanitization
│ ├── auth/
│ │ ├── __init__.py
│ │ ├── hashing.py # argon2id wrapper
│ │ ├── keys.py # key generation, prefix, verify
│ │ └── middleware.py
│ ├── ratelimit/
│ │ ├── __init__.py
│ │ ├── sliding_window.py # Redis Lua script
│ │ └── concurrency.py # semaphore via Redis
│ ├── budget/
│ │ ├── __init__.py
│ │ ├── counter.py # Redis period counters
│ │ └── ledger.py # Postgres reconciliation
│ ├── proxy/
│ │ ├── __init__.py
│ │ ├── ollama.py # httpx streaming client
│ │ ├── translate.py # OpenAI <-> Ollama schemas
│ │ ├── token_counter.py # parse usage from stream
│ │ ├── discovery.py # live model discovery from Ollama /api/tags (§4.6)
│ │ └── allowlist.py # effective-set resolution (allow_all / allowed ∩ discovered)
│ ├── routes/
│ │ ├── __init__.py
│ │ ├── ollama_native.py
│ │ ├── openai_compat.py
│ │ └── health.py
│ ├── db/
│ │ ├── __init__.py
│ │ ├── session.py
│ │ ├── models.py # SQLAlchemy 2.0
│ │ └── repositories.py
│ ├── audit/
│ │ ├── __init__.py
│ │ ├── writer.py # buffered async writer
│ │ └── prompt_log.py
│ ├── observability/
│ │ ├── __init__.py
│ │ ├── logging.py # structlog config
│ │ └── metrics.py # prometheus
│ └── cli/
│ ├── __init__.py
│ └── manage.py # typer: create-tenant, create-key, ...
├── tests/
│ ├── conftest.py # testcontainers fixtures
│ ├── unit/
│ │ ├── test_hashing.py
│ │ ├── test_translate.py
│ │ ├── test_token_counter.py
│ │ ├── test_discovery.py
│ │ ├── test_allowlist.py
│ │ └── test_sliding_window.py
│ ├── integration/
│ │ ├── test_auth_flow.py
│ │ ├── test_rate_limit.py
│ │ ├── test_budget.py
│ │ ├── test_proxy_stream.py
│ │ ├── test_openai_compat.py
│ │ ├── test_revocation.py
│ │ └── mock_ollama.py # FastAPI mock with NDJSON/SSE
│ └── load/
│ └── locustfile.py
└── docs/
├── ARCHITECTURE.md
├── DEPLOYMENT.md
├── API.md
├── THREAT_MODEL.md
└── OPERATIONS.md # runbook: revoke key, rotate, check usage
9. Non-Functional Requirements
- Performance: p50 overhead < 5 ms over direct Ollama call (auth + ratelimit + audit); p99 < 25 ms (excluding upstream latency)
- Streaming: Time-to-first-byte must not be degraded by gateway logic — audit write happens after stream close
- Memory: Steady-state RSS < 200 MiB per gateway worker under 100 concurrent streams
- Concurrency: Handle 200 concurrent connections per worker; 4 workers per instance default
- Test coverage: ≥ 85% line coverage on
src/neuronetz_gateway/excluding__main__and CLI; 100% onauth/,ratelimit/,budget/ - Security: No
eval, noexec, no shell-out, nopickle. Bandit clean.pip-auditclean on every CI run. - Type safety:
mypy --strictclean - Lint:
ruff checkclean with project ruleset (E, F, I, B, UP, S, ASYNC)
10. Tooling
- Python 3.12
uvfor dependency management (pyproject.toml + uv.lock)- FastAPI ≥ 0.115, uvicorn[standard], httpx ≥ 0.27, SQLAlchemy 2.0 (async), asyncpg, redis ≥ 5.0 (with hiredis), structlog, pydantic ≥ 2.9, pydantic-settings, argon2-cffi, typer, prometheus-client
- Test: pytest, pytest-asyncio, pytest-cov, testcontainers, httpx (test client), respx (mock), locust
- Lint/format: ruff, mypy --strict, bandit, pip-audit
- CI: GitHub Actions workflow (lint, type, test with coverage, build image, push on tag)
11. Bootstrap CLI (Typer)
neuronetz-gateway create-tenant --name "acme" [--rpm 60] [--tpm 100000]
neuronetz-gateway create-key --tenant acme --name "prod-server-1" [--scopes chat,embeddings]
neuronetz-gateway revoke-key --prefix nz_abc12345
neuronetz-gateway list-keys --tenant acme
neuronetz-gateway show-usage --tenant acme [--period day|month|total]
neuronetz-gateway set-budget --key nz_abc12345 --daily 1000000 --monthly 30000000
neuronetz-gateway set-models --tenant acme --models llama3.1:8b,mistral:7b
neuronetz-gateway set-models --tenant acme --allow-all # opt into allow_all_models
neuronetz-gateway set-models --tenant acme --no-allow-all # back to explicit allowlist
neuronetz-gateway list-models [--tenant acme] # show live-discovered models
# (and the tenant's effective set)
create-tenant accepts --allow-all-models / --no-allow-all-models (default off).
list-models reads the discovery cache (§4.6); with --tenant it also shows that tenant's
resolved effective set.
Key format: nz_<12-char-prefix><32-char-random>. Prefix is stored; full key is hashed (argon2id). On creation, the full key is printed exactly once.
12. Acceptance Criteria
The build is "done" when every box below is checked. The orchestrator must verify each before declaring v0.1.0.
docker compose upfrom a clean checkout produces a running stack with TLS via Caddy (self-signed in dev, Let's Encrypt-ready in prod).- CLI creates tenant and key; printed key successfully authenticates an
/api/chatcall. - Unauthenticated request returns 401 with no Ollama details leaked.
- Request to
/api/pullreturns 403 with generic error message. - Streaming
/api/chatworks end-to-end; first byte arrives within Ollama's own TTFB + < 10 ms gateway overhead. - Streaming
/v1/chat/completionsreturns valid SSE withdata: [DONE]terminator. - Token counts in audit log match Ollama's reported
prompt_eval_count+eval_countexactly. /api/tagsand/v1/modelsreflect the live Ollama model set (discovery, §4.6): anallow_all_modelstenant sees every installed model and a newly-pulled model appears within one refresh interval; a default-deny tenant sees onlyallowed_models ∩ discovered; a request for a model outside the effective set returns a generic 403; with discovery unavailable, requests fail closed (deny), not open.- Rate limit triggers at configured RPM with
Retry-Afterheader. - Token budget enforces and blocks at zero remaining with descriptive error.
- Redis outage causes 503 (fail-closed), not 200.
- Revocation via
INSERT INTO gateway.revocationsevicts Redis cache within 1 second. mypy --strict,ruff check,bandit,pip-auditall clean in CI.- Test coverage ≥ 85% overall, 100% in
auth/,ratelimit/,budget/. docs/THREAT_MODEL.md,docs/DEPLOYMENT.md,docs/OPERATIONS.mdpresent and accurate.- Load test (locust): 100 concurrent users sustained 5 minutes, p99 gateway overhead < 25 ms, zero 5xx outside induced failures.
13. Open Questions (decide during build)
- Embedding cost accounting — Ollama doesn't return
eval_countfor embeddings. Decision: charge based onprompt_eval_countonly; document as such. - SSE vs NDJSON heuristic for OpenAI-compat — always SSE per OpenAI spec. NDJSON only on native
/api/*. - Prometheus cardinality — do not label by
key_id(too many series); label bytenant_idonly; per-key data lives in Postgres. - Model discovery source — the live model list is
GET /api/tagson the Ollama backend; there is no separate registry. Cached in Redis + in-process, refreshed everyMODEL_DISCOVERY_REFRESH_S. - Discovery failure is fail-closed — empty/expired discovered set ⇒ no model resolves ⇒ deny. Discovery never opens access on error.
- No existence disclosure — a model that is installed-but-unpermitted and a model that is not installed both return the same generic response, to prevent enumeration.
allow_all_modelsprecedence — key-levelallow_all_models(when non-NULL) overrides the tenant flag; otherwise the tenant flag applies. Same NULL-inherits-tenant rule as the other key limits.
14. References
- Ollama API: https://github.com/ollama/ollama/blob/main/docs/api.md
- OpenAI Chat Completions: https://platform.openai.com/docs/api-reference/chat
- Nibiru (sibling console project): https://nibiru-framework.com
- Argon2 RFC 9106