stage-3e: well-posed ioc_extraction dataset + clearer /train page

ioc_extraction ExampleBuilder now embeds every IOC into the advisory text so
the extraction task is answerable from the input (v1 asked the model to
"extract" a URL that was never given). /train page distinguishes trained /
training… / not-started, and renders a per-step loss bar chart. Dockerfile no
longer bakes the training script — scripts/ is mounted at run time so edits
take effect without a 21 GB rebuild (this is why psyc-v2's loss capture was
silently skipped on its first run).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
m17hr1l
2026-05-17 18:09:37 +02:00
parent c6655853ac
commit b4c66c2e87
5 changed files with 80 additions and 37 deletions

View File

@@ -124,15 +124,15 @@ To fine-tune Qwen3.5-4B with QLoRA in an NVIDIA Docker container:
# 2. build the training image (pytorch 2.6/CUDA 12.4 base + unsloth + Qwen3.5)
docker build -t psyc-trainer -f Dockerfile.train .
# 3. fine-tune (mount host data/ so adapters land there)
docker run --gpus all --rm \
-v $(pwd)/data:/data \
psyc-trainer \
--dataset /data/datasets/ioc_extraction-v1.jsonl \
--dataset /data/datasets/severity_classification-v1.jsonl \
--dataset /data/datasets/routing_decision-v1.jsonl \
--dataset /data/datasets/tlp_assignment-v1.jsonl \
--output /data/adapters/psyc-v1
# 3. fine-tune — scripts/ + data/ are mounted, so script edits need no rebuild
docker run --gpus all --rm --entrypoint python \
-v $(pwd)/data:/data -v $(pwd)/scripts:/scripts \
psyc-trainer /scripts/train_qlora.py \
--dataset /data/datasets/ioc_extraction-v2.jsonl \
--dataset /data/datasets/severity_classification-v2.jsonl \
--dataset /data/datasets/routing_decision-v2.jsonl \
--dataset /data/datasets/tlp_assignment-v2.jsonl \
--output /data/adapters/psyc-v2
```
Defaults target a 24 GB consumer GPU (3090/4090): `unsloth/Qwen3.5-4B` at 4-bit,
@@ -150,10 +150,13 @@ docker run --gpus all --rm \
--entrypoint python \
-v $(pwd)/data:/data -v $(pwd)/scripts:/scripts \
psyc-trainer /scripts/eval_adapter.py \
--adapter /data/adapters/psyc-v1/final \
--dataset /data/datasets/ioc_extraction-v1.jsonl --n 5
--adapter /data/adapters/psyc-v2/final \
--dataset /data/datasets/ioc_extraction-v2.jsonl --n 5
```
The cockpit `/train` page lists every built dataset and trained adapter with
its base model, hyperparameters, dataset provenance, and a per-step loss chart.
## Status
Day 2 of a 48h build. Shipped: Scoutline (URLhaus) → Classifyline → Mapline