Commit Graph

6 Commits

Author SHA1 Message Date
m17hr1l
2c2ead6149 stage-28 fix: deploy.sh pre-trusts the Gitea SSH host key (first-clone)
A fresh prod box has never SSH'd to gitea.neuronetz.ai before, so the
first 'git clone' failed with 'Host key verification failed'. The
script now parses the git remote URL to extract host+port, and on the
prod box does an ssh-keyscan into ~/.ssh/known_hosts before the clone
when the entry is missing. TOFU — if you want to verify the fingerprint
out-of-band, pre-populate known_hosts manually and the script will see
the entry and skip the scan.

Also: if the clone still fails after the host key is trusted (likely a
missing SSH key on Gitea side), the script now prints a clear hint
pointing at where to register it. Supports both ssh://user@host:port/
and user@host: URL forms.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 15:32:44 +02:00
m17hr1l
61b7b8ef20 stage-28: deploy.sh — idempotent remote deploy + health probe
scripts/deploy.sh pushes the current branch to origin, ssh's into the
prod box (neuronetz@cloud.neuronetz.ai:/home/neuronetz/docker-public/
neuro-psyc by default — overridable via env vars), clones-or-pulls,
ensures the external 'backend' docker network exists, runs docker
compose up -d --build (+ --profile gpu if PSYC_PROD_GPU=1), and then
verifies the cockpit is healthy both on prod-internal :8767 and at the
public URL — so the script ends knowing whether the page is up.

Refuses to touch prod's .env (warns + copies .env.example if missing,
so you can edit it manually). Never transfers data/ or adapters
(gitignored; prod fetches its own corpus). Color output, idempotent,
safe to re-run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 14:51:47 +02:00
m17hr1l
2a9c0bf34a stage-6: model inference server
scripts/serve_model.py — FastAPI in the CUDA container, loads base Qwen3.5-4B
+ a psyc adapter once and serves POST /infer. Lets the cockpit (no torch in
its venv) put a real fine-tuned model behind a Worker Mesh bot over HTTP.
Dockerfile.train gains a fastapi + uvicorn layer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 21:05:16 +02:00
m17hr1l
c6655853ac stage-3d: cockpit /train page — datasets + adapters + training metadata
New /train route lists built JSONL datasets (examples, size) and trained
adapters with their base model, hyperparameters, dataset provenance, and
loss history. train_qlora.py now records train_loss + per-step loss_history
into training_meta.json so future runs surface a loss curve in the cockpit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 15:16:46 +02:00
m17hr1l
b95e3e02bd stage-3c: working QLoRA training + eval — pytorch base, Qwen3.5 slug, SFTConfig
Training and eval now run clean on the unsloth 2026.5.2 / transformers v5 /
torch 2.10 stack. Fixes: pytorch/pytorch base image (sidesteps the nvidia/cuda
apt-signature failure and the torch download), correct base-model slug
unsloth/Qwen3.5-4B, TRL SFTConfig API. Adds scripts/eval_adapter.py — runs
dataset rows through base+adapter with structured (transformers-v5) message
content and Qwen3.5 thinking-mode stripping.

First v1 adapter: loss 2.10 -> 0.32 over 3 epochs. Eval surfaced an ill-posed
ioc_extraction dataset (output URL not present in input) — to be fixed in the
ExampleBuilder before the next training run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 14:16:22 +02:00
m17hr1l
f1ab11f89d stage-3c: unsloth QLoRA training scaffold for Qwen3.5
Dockerfile.train builds a CUDA 12.4 + unsloth container that consumes the
Trainline JSONL datasets and emits a LoRA adapter at data/adapters/<run>/final.
Defaults target a 24 GB GPU (Qwen3.5-4B-Instruct-bnb-4bit, r=16, bf16, 3 epochs,
effective batch 8). README documents the build + run workflow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-14 14:17:14 +02:00