stage-3c: working QLoRA training + eval — pytorch base, Qwen3.5 slug, SFTConfig

Training and eval now run clean on the unsloth 2026.5.2 / transformers v5 /
torch 2.10 stack. Fixes: pytorch/pytorch base image (sidesteps the nvidia/cuda
apt-signature failure and the torch download), correct base-model slug
unsloth/Qwen3.5-4B, TRL SFTConfig API. Adds scripts/eval_adapter.py — runs
dataset rows through base+adapter with structured (transformers-v5) message
content and Qwen3.5 thinking-mode stripping.

First v1 adapter: loss 2.10 -> 0.32 over 3 epochs. Eval surfaced an ill-posed
ioc_extraction dataset (output URL not present in input) — to be fixed in the
ExampleBuilder before the next training run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
m17hr1l
2026-05-17 14:16:22 +02:00
parent f1ab11f89d
commit b95e3e02bd
4 changed files with 121 additions and 35 deletions

View File

@@ -12,36 +12,19 @@
# --dataset /data/datasets/routing_decision-v1.jsonl \ # --dataset /data/datasets/routing_decision-v1.jsonl \
# --dataset /data/datasets/tlp_assignment-v1.jsonl \ # --dataset /data/datasets/tlp_assignment-v1.jsonl \
# --output /data/adapters/psyc-v1 # --output /data/adapters/psyc-v1
#
# Base image already ships Python 3.11 + torch 2.6 + CUDA 12.4 + cuDNN9, so
# there is no apt step and no torch download. Qwen3.5 needs transformers v5 —
# unsloth pulls it automatically.
FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 FROM pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel
ENV DEBIAN_FRONTEND=noninteractive \ ENV PYTHONUNBUFFERED=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \ PIP_NO_CACHE_DIR=1 \
HF_HOME=/data/.hf-cache HF_HOME=/data/.hf-cache
RUN apt-get update && apt-get install -y --no-install-recommends \ RUN pip install --upgrade pip && \
python3.11 python3.11-venv python3-pip \ pip install unsloth unsloth_zoo trl datasets
git curl ca-certificates \
&& rm -rf /var/lib/apt/lists/* \
&& ln -sf /usr/bin/python3.11 /usr/local/bin/python \
&& ln -sf /usr/bin/python3.11 /usr/local/bin/python3
RUN python -m pip install --upgrade pip wheel setuptools && \
python -m pip install \
torch==2.5.1 \
--index-url https://download.pytorch.org/whl/cu124
RUN python -m pip install \
"unsloth @ git+https://github.com/unslothai/unsloth.git" \
transformers>=4.46 \
datasets>=3.0 \
peft>=0.13 \
trl>=0.12 \
accelerate>=1.1 \
bitsandbytes>=0.44 \
sentencepiece \
protobuf
WORKDIR /workspace WORKDIR /workspace
COPY scripts/train_qlora.py /workspace/train_qlora.py COPY scripts/train_qlora.py /workspace/train_qlora.py

View File

@@ -121,7 +121,7 @@ To fine-tune Qwen3.5-4B with QLoRA in an NVIDIA Docker container:
# 1. build datasets (one-off; re-run after ingestion changes) # 1. build datasets (one-off; re-run after ingestion changes)
.venv/bin/psyc train-build-all .venv/bin/psyc train-build-all
# 2. build the training image (CUDA 12.4 + unsloth + Qwen3.5) # 2. build the training image (pytorch 2.6/CUDA 12.4 base + unsloth + Qwen3.5)
docker build -t psyc-trainer -f Dockerfile.train . docker build -t psyc-trainer -f Dockerfile.train .
# 3. fine-tune (mount host data/ so adapters land there) # 3. fine-tune (mount host data/ so adapters land there)
@@ -135,14 +135,25 @@ docker run --gpus all --rm \
--output /data/adapters/psyc-v1 --output /data/adapters/psyc-v1
``` ```
Defaults target a 24 GB consumer GPU (3090/4090): Qwen3.5-4B-Instruct at 4-bit, Defaults target a 24 GB consumer GPU (3090/4090): `unsloth/Qwen3.5-4B` at 4-bit,
LoRA `r=16`/`alpha=16`, bf16, 3 epochs, effective batch size 8. For A100-40/80 LoRA `r=16`/`alpha=16`, bf16, 3 epochs, effective batch size 8. For A100-40/80
bump `--base-model unsloth/Qwen3.5-9B-Instruct-bnb-4bit` and raise bump `--base-model unsloth/Qwen3.5-9B` and raise `--batch-size` +
`--batch-size` + `--max-seq-length`. `--max-seq-length`.
Output: `data/adapters/psyc-v1/final/` (adapter weights) + `training_meta.json` Output: `data/adapters/psyc-v1/final/` (adapter weights) + `training_meta.json`
(base model, hyperparameters, dataset list). (base model, hyperparameters, dataset list).
Evaluate the adapter against held-out dataset rows:
```bash
docker run --gpus all --rm \
--entrypoint python \
-v $(pwd)/data:/data -v $(pwd)/scripts:/scripts \
psyc-trainer /scripts/eval_adapter.py \
--adapter /data/adapters/psyc-v1/final \
--dataset /data/datasets/ioc_extraction-v1.jsonl --n 5
```
## Status ## Status
Day 2 of a 48h build. Shipped: Scoutline (URLhaus) → Classifyline → Mapline Day 2 of a 48h build. Shipped: Scoutline (URLhaus) → Classifyline → Mapline

93
scripts/eval_adapter.py Normal file
View File

@@ -0,0 +1,93 @@
"""Evaluate a psyc QLoRA adapter — run held-out dataset rows through the model.
Run inside the psyc training container (override the entrypoint):
docker run --gpus all --rm --entrypoint python \
-v $(pwd)/data:/data -v $(pwd)/scripts:/scripts \
psyc-trainer /scripts/eval_adapter.py \
--adapter /data/adapters/psyc-v1/final \
--dataset /data/datasets/ioc_extraction-v1.jsonl --n 5
Sanity check, not a benchmark: for `--n` rows it prints the prompt, the model's
generation, and the dataset's reference output side by side. With a tiny
dataset the model has seen these rows, so this verifies the adapter learned the
output FORMAT and task shape — not generalization.
"""
from __future__ import annotations
# unsloth must be imported BEFORE transformers.
from unsloth import FastLanguageModel # noqa: I001
import argparse
import json
import re
from pathlib import Path
from typing import Dict, List
def strip_think(text: str) -> str:
"""Drop Qwen3.5 thinking-mode blocks so exact-match compares the answer only."""
return re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
def load_examples(path: Path, n: int) -> List[Dict[str, str]]:
out: List[Dict[str, str]] = []
with path.open("r", encoding="utf-8") as fh:
for line in fh:
line = line.strip()
if not line:
continue
out.append(json.loads(line))
if len(out) >= n:
break
return out
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--adapter", required=True, help="path to adapter final/ dir")
parser.add_argument("--base-model", default="unsloth/Qwen3.5-4B")
parser.add_argument("--dataset", required=True, help="JSONL to sample test rows from")
parser.add_argument("--n", type=int, default=5)
parser.add_argument("--max-seq-length", type=int, default=4096)
parser.add_argument("--max-new-tokens", type=int, default=256)
args = parser.parse_args()
examples = load_examples(Path(args.dataset), args.n)
if not examples:
raise SystemExit("no examples loaded")
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=args.adapter,
max_seq_length=args.max_seq_length,
dtype=None,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
correct = 0
for i, ex in enumerate(examples, 1):
prompt = f"{ex['instruction']}\n\n{ex['input']}"
messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
enable_thinking=False,
).to(model.device)
out = model.generate(input_ids=inputs, max_new_tokens=args.max_new_tokens, do_sample=False)
generated = strip_think(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
expected = ex["output"].strip()
match = generated == expected
correct += int(match)
print(f"\n===== example {i}/{len(examples)} [{ex.get('task', '?')}] {'MATCH' if match else 'DIFF'} =====")
print(f"-- prompt --\n{prompt[:600]}")
print(f"-- expected --\n{expected[:600]}")
print(f"-- generated --\n{generated[:600]}")
print(f"\n[psyc-eval] exact-match {correct}/{len(examples)}")
if __name__ == "__main__":
main()

View File

@@ -22,8 +22,7 @@ from pathlib import Path
from typing import Dict, List from typing import Dict, List
from datasets import Dataset from datasets import Dataset
from transformers import TrainingArguments from trl import SFTConfig, SFTTrainer
from trl import SFTTrainer
def load_examples(paths: List[Path]) -> List[Dict[str, str]]: def load_examples(paths: List[Path]) -> List[Dict[str, str]]:
@@ -44,7 +43,7 @@ def load_examples(paths: List[Path]) -> List[Dict[str, str]]:
def main() -> None: def main() -> None:
parser = argparse.ArgumentParser(description=__doc__) parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--dataset", action="append", required=True, help="JSONL path (repeatable)") parser.add_argument("--dataset", action="append", required=True, help="JSONL path (repeatable)")
parser.add_argument("--base-model", default="unsloth/Qwen3.5-4B-Instruct-bnb-4bit") parser.add_argument("--base-model", default="unsloth/Qwen3.5-4B")
parser.add_argument("--output", default="/data/adapters/psyc-v1") parser.add_argument("--output", default="/data/adapters/psyc-v1")
parser.add_argument("--epochs", type=int, default=3) parser.add_argument("--epochs", type=int, default=3)
parser.add_argument("--lr", type=float, default=2e-4) parser.add_argument("--lr", type=float, default=2e-4)
@@ -96,9 +95,9 @@ def main() -> None:
model=model, model=model,
tokenizer=tokenizer, tokenizer=tokenizer,
train_dataset=dataset, train_dataset=dataset,
dataset_text_field="text", args=SFTConfig(
max_seq_length=args.max_seq_length, dataset_text_field="text",
args=TrainingArguments( max_seq_length=args.max_seq_length,
per_device_train_batch_size=args.batch_size, per_device_train_batch_size=args.batch_size,
gradient_accumulation_steps=args.grad_accum, gradient_accumulation_steps=args.grad_accum,
warmup_steps=5, warmup_steps=5,