stage-3e: well-posed ioc_extraction dataset + clearer /train page

ioc_extraction ExampleBuilder now embeds every IOC into the advisory text so the extraction task is answerable from the input (v1 asked the model to "extract" a URL that was never given). /train page distinguishes trained / training… / not-started, and renders a per-step loss bar chart. Dockerfile no longer bakes the training script — scripts/ is mounted at run time so edits take effect without a 21 GB rebuild (this is why psyc-v2's loss capture was silently skipped on its first run). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 18:09:37 +02:00
parent c6655853ac
commit b4c66c2e87
5 changed files with 80 additions and 37 deletions
--- a/Dockerfile.train
+++ b/Dockerfile.train
@@ -3,19 +3,20 @@
 # Build:
 #   docker build -t psyc-trainer -f Dockerfile.train .
 #
-# Run (24 GB GPU, mounts host data/ for datasets + adapter output):
-#   docker run --gpus all --rm \
-#       -v $(pwd)/data:/data \
-#       psyc-trainer \
-#       --dataset /data/datasets/ioc_extraction-v1.jsonl \
-#       --dataset /data/datasets/severity_classification-v1.jsonl \
-#       --dataset /data/datasets/routing_decision-v1.jsonl \
-#       --dataset /data/datasets/tlp_assignment-v1.jsonl \
-#       --output /data/adapters/psyc-v1
+# Run (24 GB GPU; mounts host data/ + scripts/ so script edits need no rebuild):
+#   docker run --gpus all --rm --entrypoint python \
+#       -v $(pwd)/data:/data -v $(pwd)/scripts:/scripts \
+#       psyc-trainer /scripts/train_qlora.py \
+#       --dataset /data/datasets/ioc_extraction-v2.jsonl \
+#       --dataset /data/datasets/severity_classification-v2.jsonl \
+#       --dataset /data/datasets/routing_decision-v2.jsonl \
+#       --dataset /data/datasets/tlp_assignment-v2.jsonl \
+#       --output /data/adapters/psyc-v2
 #
 # Base image already ships Python 3.11 + torch 2.6 + CUDA 12.4 + cuDNN9, so
 # there is no apt step and no torch download. Qwen3.5 needs transformers v5 —
-# unsloth pulls it automatically.
+# unsloth pulls it automatically. The training/eval scripts are MOUNTED at run
+# time (not baked in) so editing scripts/*.py never needs an image rebuild.

 FROM pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel

@@ -27,6 +28,6 @@ RUN pip install --upgrade pip && \
    pip install unsloth unsloth_zoo trl datasets

 WORKDIR /workspace
-COPY scripts/train_qlora.py /workspace/train_qlora.py

-ENTRYPOINT ["python", "/workspace/train_qlora.py"]
+# Scripts are mounted at run time (-v $(pwd)/scripts:/scripts), never baked in.
+ENTRYPOINT ["python"]