Commit Graph

10 Commits

Author SHA1 Message Date
m17hr1l
77e4cb6ab9 deploy-all: redirect deploy.sh stdin from /dev/null so loop doesn't drop hosts 2026-06-07 02:06:45 +02:00
m17hr1l
2c7f71eff8 deploy: scripts/deploy-all.sh + hosts.example for multi-node federation rollouts 2026-06-07 01:03:31 +02:00
m17hr1l
9c3447723a stage-28 fix: deploy.sh — auto-trust Gitea host (TOFU), never touch identity keys
Reinstating the auto known_hosts entry on first deploy. Clear scope:
host trust (TOFU known_hosts entry) is automated — same as
'ssh -o StrictHostKeyChecking=accept-new' would do; identity keypairs
(~/.ssh/id_*) are never generated/copied/modified by deploy.sh.
PSYC_SKIP_HOST_TRUST=1 disables the auto-trust step if you'd rather
verify fingerprints manually.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:36:18 +02:00
m17hr1l
9edd56e28b stage-28 fix: deploy.sh — read-only SSH preflight, no key/known_hosts edits
User asked the script not to touch their SSH config. Reverted the
auto-ssh-keyscan; the script now only READS ~/.ssh/known_hosts (via
ssh-keygen -F) and, when the entry is missing, exits with explicit
manual instructions for verifying the host key and registering an
identity key in Gitea. Identical behavior on the happy path; clearer
diagnostics on the unhappy path; zero modification of ~/.ssh anywhere.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 15:39:06 +02:00
m17hr1l
2c2ead6149 stage-28 fix: deploy.sh pre-trusts the Gitea SSH host key (first-clone)
A fresh prod box has never SSH'd to gitea.neuronetz.ai before, so the
first 'git clone' failed with 'Host key verification failed'. The
script now parses the git remote URL to extract host+port, and on the
prod box does an ssh-keyscan into ~/.ssh/known_hosts before the clone
when the entry is missing. TOFU — if you want to verify the fingerprint
out-of-band, pre-populate known_hosts manually and the script will see
the entry and skip the scan.

Also: if the clone still fails after the host key is trusted (likely a
missing SSH key on Gitea side), the script now prints a clear hint
pointing at where to register it. Supports both ssh://user@host:port/
and user@host: URL forms.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 15:32:44 +02:00
m17hr1l
61b7b8ef20 stage-28: deploy.sh — idempotent remote deploy + health probe
scripts/deploy.sh pushes the current branch to origin, ssh's into the
prod box (neuronetz@cloud.neuronetz.ai:/home/neuronetz/docker-public/
neuro-psyc by default — overridable via env vars), clones-or-pulls,
ensures the external 'backend' docker network exists, runs docker
compose up -d --build (+ --profile gpu if PSYC_PROD_GPU=1), and then
verifies the cockpit is healthy both on prod-internal :8767 and at the
public URL — so the script ends knowing whether the page is up.

Refuses to touch prod's .env (warns + copies .env.example if missing,
so you can edit it manually). Never transfers data/ or adapters
(gitignored; prod fetches its own corpus). Color output, idempotent,
safe to re-run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 14:51:47 +02:00
m17hr1l
2a9c0bf34a stage-6: model inference server
scripts/serve_model.py — FastAPI in the CUDA container, loads base Qwen3.5-4B
+ a psyc adapter once and serves POST /infer. Lets the cockpit (no torch in
its venv) put a real fine-tuned model behind a Worker Mesh bot over HTTP.
Dockerfile.train gains a fastapi + uvicorn layer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 21:05:16 +02:00
m17hr1l
c6655853ac stage-3d: cockpit /train page — datasets + adapters + training metadata
New /train route lists built JSONL datasets (examples, size) and trained
adapters with their base model, hyperparameters, dataset provenance, and
loss history. train_qlora.py now records train_loss + per-step loss_history
into training_meta.json so future runs surface a loss curve in the cockpit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 15:16:46 +02:00
m17hr1l
b95e3e02bd stage-3c: working QLoRA training + eval — pytorch base, Qwen3.5 slug, SFTConfig
Training and eval now run clean on the unsloth 2026.5.2 / transformers v5 /
torch 2.10 stack. Fixes: pytorch/pytorch base image (sidesteps the nvidia/cuda
apt-signature failure and the torch download), correct base-model slug
unsloth/Qwen3.5-4B, TRL SFTConfig API. Adds scripts/eval_adapter.py — runs
dataset rows through base+adapter with structured (transformers-v5) message
content and Qwen3.5 thinking-mode stripping.

First v1 adapter: loss 2.10 -> 0.32 over 3 epochs. Eval surfaced an ill-posed
ioc_extraction dataset (output URL not present in input) — to be fixed in the
ExampleBuilder before the next training run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 14:16:22 +02:00
m17hr1l
f1ab11f89d stage-3c: unsloth QLoRA training scaffold for Qwen3.5
Dockerfile.train builds a CUDA 12.4 + unsloth container that consumes the
Trainline JSONL datasets and emits a LoRA adapter at data/adapters/<run>/final.
Defaults target a 24 GB GPU (Qwen3.5-4B-Instruct-bnb-4bit, r=16, bf16, 3 epochs,
effective batch 8). README documents the build + run workflow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-14 14:17:14 +02:00