Files
finetuning-plattform-setup-…/USER_WORKFLOW.md
m17hr1l 33635796ee docs: add USER_WORKFLOW.md — end-user journey from idea to trained model
Practical walk-through for someone landing on the platform without
context: pick a base model, prepare data, run a fine-tune, test the
result. Grounded in what the platform actually does today.

Covers the four main pages (/models, /datasets/upload, /jobs/create,
/jobs/view) and flags current limitations (NFP-52 chat-test gap for
platform-trained models, etc.). Includes an iteration table mapping
common failure modes to fixes, and a "first-week mistakes" list.

Linked from README alongside the developer MANUAL.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 22:57:08 +02:00

302 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# From Idea to Trained Model — End-User Workflow
A complete walk-through for someone landing on the platform and going from "I want a model that does X" to "I have a model I can test and deploy."
The four phases:
1. [**Pick a model** for your purpose](#1-pick-a-model)
2. [**Get the data** to train it on](#2-get-the-data)
3. [**Train** the model with your data](#3-train-the-model)
4. [**Test** what came out](#4-test-the-result)
Plus what to do when the answer isn't good enough yet — [Iterate](#5-iterate).
---
## 1. Pick a model
A fine-tune is a **base model + your data**. The base shapes capability ceiling, response style, and inference cost. Get this right, the rest is easier.
### Decide what shape of model fits the task
| Task | What you need | Typical base model size |
|---|---|---|
| **Chat assistant** — answer customer questions, knowledge Q&A | An instruct-tuned chat model | 3B8B params |
| **Structured extraction** — pull fields from documents, classify | Smaller instruct or completion model | 1B3B (faster, cheaper) |
| **Code completion** — domain-specific code helpers | Code-specialized model | 7B+ |
| **Summarization, rewriting** | Any instruct model | 3B7B |
| **Embeddings** (semantic search) | Embedding-only model | Tiny (270M1B) |
**Rule of thumb on size:** smaller is faster and cheaper. Start at the smallest model that demonstrates the capability — fine-tune lifts performance dramatically, but only within the base's reachable space. A 1B model fine-tuned well on a narrow task often beats a 70B general model.
### Find a base model
Two sources:
**Option A — Already on the platform (fastest)**
Open `/models`**Pulled Models** tab. You see Ollama-registered models already loaded and ready. Today's shared instance has examples like `llama3.2:3b`, `qwen2:latest`, `deepseek-coder:latest`, plus several existing fine-tunes (e.g. `ustid-extractor-v2`).
If one is close to your task, use it. No download wait.
**Option B — Pull from HuggingFace**
`/models`**Model Hub** tab → search by name or capability. Filter by `GGUF` format (that's what the platform's inference engine uses).
Practical picks:
- General chat 3B: `bartowski/Llama-3.2-3B-Instruct-GGUF`
- General chat 78B: `bartowski/Meta-Llama-3.1-8B-Instruct-GGUF`
- Code: `bartowski/Qwen2.5-Coder-7B-Instruct-GGUF`
- Tiny/fast: `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF`
Click **Pull**. The download runs async with a progress bar. When it finishes, the model appears in **Pulled Models** with a Chat button (for chat-capable ones) and a Generate button (for completion-only ones).
> **Gated models** like the original Meta Llama require an HF token. Add yours in `/settings` → Integrations → HuggingFace API Key. Without a token, only public/community quants work.
### Sanity-check the base before you commit
Before pouring training compute into a fine-tune, chat with the base model first. Ask it 510 questions in your domain. You're checking three things:
1. **Does it understand the language/jargon?** If not, you need a bigger base or different domain pre-training — fine-tuning won't close a knowledge gap of that size.
2. **Does it follow instructions in the format you want?** Format you can teach via fine-tune. Capability you can't.
3. **Is response speed acceptable on your hardware?** A 70B model on a 16 GB GPU will be unusable regardless of quality.
If the base fails (1), pick a bigger or different base. Iterate at this step — it's free.
---
## 2. Get the data
Where most fine-tunes succeed or fail. Quality > quantity > model size.
### How much data do you actually need
| Goal | Examples typically needed |
|---|---|
| Style/tone shift (sound like our brand) | 100500 |
| Domain knowledge injection (our product manuals) | 5005,000 |
| Structured extraction (fields out of forms) | 2002,000 |
| Function-calling / tool-use specialization | 1,00010,000 |
| New capability the base lacks | 10,000+ (and probably won't work — try a different base) |
These are **clean, high-quality** examples. 200 hand-curated examples beat 10,000 noisy ones. Always.
### Format
The platform expects **JSONL** (one JSON object per line). Two common shapes:
**Instruction format** (most flexible):
```json
{"instruction": "Summarize the following support ticket.", "input": "Customer says...", "output": "TL;DR: ..."}
{"instruction": "Classify intent.", "input": "Where's my order?", "output": "order_status"}
```
**Chat format** (for chat-style models):
```json
{"messages": [{"role": "user", "content": "What's your return policy?"}, {"role": "assistant", "content": "30 days, full refund..."}]}
```
The platform auto-detects the format on upload and warns you if rows are malformed.
### Three ways to get data into the platform
Open `/datasets/upload`. Three tabs:
**Tab 1 — Upload JSONL** *(if you already have data)*
- Drag-and-drop. Validation runs immediately. Summary shows row count, average tokens per example, format detected, anything broken.
- Acceptable: `.jsonl`, `.json` (will convert if it's an array of objects).
- Reject and fix on your end: malformed JSON, missing required fields, examples wildly out-of-distribution.
**Tab 2 — Document → Q&A** *(if you have docs but no examples yet)*
- Upload PDFs, Markdown, HTML, plain text. The platform chunks them, runs an LLM over each chunk to generate Q&A pairs grounded in the source, and produces a JSONL.
- Best for: knowledge-base assistants, customer-support bots, internal documentation.
- Always **review the generated pairs before training**. The LLM doing the generation is not perfect — drop hallucinations, fix ambiguous wordings. The platform shows them in a reviewable table.
- Rough yield: ~38 Q&A pairs per page of source material.
**Tab 3 — Synthetic generation** *(if you have nothing yet)*
- Describe the persona, task, and example shape. The platform generates examples programmatically against a seed model.
- Useful for: bootstrapping a tone/style fine-tune, augmenting a small real-data seed.
- Limitation: synthetic data drifts toward the generator's biases. Use it to scale a small real-data seed, not as the sole source.
### Quality gates to apply yourself before training
- **Deduplicate.** Identical or near-identical examples waste training cycles and risk overfitting.
- **Variety check.** Examples should span the input space you'll see in production. If 90% of your training data is one phrasing pattern, the model overfits to that pattern.
- **Hold out a test set.** Take 1020% of your examples, set them aside, do NOT include in training. You'll use them in Step 4 to evaluate honestly.
- **Sanity-read 20 examples at random.** If you cringe at any, your model will produce worse versions of those.
---
## 3. Train the model
Open `/jobs/create`.
### Pick your inputs
- **Base model** — what you settled on in Step 1
- **Dataset** — what you prepared in Step 2 (excluding your held-out test set)
- **Job name** — descriptive, you'll see it in the list later. e.g. `support-bot-v1`, `extractor-invoices-q3`
### Pick a preset (the easy path)
The platform ships three presets. For your first run, **use a preset.** Tune later.
| Preset | Epochs | LR | When to use |
|---|---|---|---|
| **Quick** | 1 | 3e-4 | First test run. ~310 min on most models. Verifies the pipeline works end-to-end and the loss curve makes sense. |
| **Standard** | 3 | 2e-4 | Default for real fine-tunes. Balanced. |
| **Thorough** | 5 | 1e-4 | Larger datasets, when you need maximum learning. Risk: overfitting on small datasets. |
### Advanced parameters (skip on first run)
If you click "Advanced":
- **Batch size** — 4 default. Lower if you OOM on a small GPU. Higher = faster training but more memory.
- **LoRA rank** — 16 default. Higher = more capacity to learn (and overfit). Lower = lighter, faster, often sufficient.
- **LoRA alpha** — 32 default. Convention: 2 × rank.
- **Context size** — 20484096 default. Match this to the longest example in your dataset. Going wider costs memory.
- **Quantization** — `4bit` default for training (QLoRA), `q4_k_m` for the final GGUF export. Smaller = faster inference, lower quality.
> **First-run heuristic:** don't change any of these. Run Quick, look at the loss curve, decide whether to retrain with Standard.
### Submit and watch
Click **Start training**. You're routed to `/jobs/view?id=<n>`. You'll see:
- **Live status** — Queued → Spawning container → Loading model → Training → Exporting → Done
- **Progress bar** — % of total training steps complete
- **Loss curve** — updates every few steps, streamed live via WebSocket. Your most important signal.
- **GPU utilization** — should be >80% while training. If not, you're CPU-bound and something's wrong.
- **Logs** — raw output. Useful when things go sideways.
**Reading the loss curve:**
- **Goes down smoothly** → working as expected
- **Plateaus quickly** → either model already knows this (good, you're done) or the LR is too low
- **Spikes / NaNs** → LR too high, or bad data in batch. Stop and investigate.
- **Goes down then back up** → overfitting; you trained too many epochs. Use the earlier checkpoint.
### What you get when it finishes
- A GGUF file of the fine-tuned model in your `/models` library, status **`ready`**
- A diff metric (loss before/after)
- The Modelfile the platform generated (visible in **Models → \<your model\> → Modelfile** tab)
- A unique model name you'll use to call it
Typical runtimes (rough, depends on hardware):
- Quick on 3B base + 500 examples: 310 min
- Standard on 8B base + 5,000 examples: 3090 min
- Thorough on 8B base + 10,000 examples: 26 hours
---
## 4. Test the result
The biggest mistake at this stage: chatting with your model a few times, feeling impressed, calling it done. Don't.
### Step A — Chat with it in the browser
Open `/models` → your trained model → **Chat**.
> **Today's caveat:** the chat button shows up for models with a proper chat template. Some platform-trained GGUFs need the Modelfile's template configured before the button appears. If you don't see the button, edit the Modelfile in `/models/<id>/edit` and confirm the `TEMPLATE` field contains `{{ .Messages }}`. (Tracked as NFP-52 — automatic handling of platform-trained models in the chat tester is in progress.)
Ask 1020 questions. Compare side-by-side against the base model (open a second chat tab with the un-fine-tuned base). For each question note:
- Did the fine-tune get closer to the right answer?
- Did the fine-tune preserve the base's general capability (did it forget how to be coherent)?
- Did the fine-tune introduce new failure modes?
### Step B — Run your held-out test set
Take the 1020% you set aside in Step 2.
**Path 1 — Manual through the chat UI**
For small test sets (< 50 examples). Paste each prompt, capture the output, score yourself.
**Path 2 — API**
For larger test sets. The platform exposes an OpenAI-compatible endpoint at `https://api.neuronetz.ai/v1/chat/completions` (use your API key from `/apikeys`):
```python
import openai
client = openai.OpenAI(
api_key="<your_platform_api_key>",
base_url="https://api.neuronetz.ai/v1",
)
resp = client.chat.completions.create(
model="your-model-name",
messages=[{"role": "user", "content": "Test prompt here"}],
temperature=0.0, # deterministic for evaluation
)
print(resp.choices[0].message.content)
```
Loop over your test set, capture outputs, score them. Scoring rubric depends on task:
- **Classification / extraction:** exact-match or F1 against ground truth
- **Generation / chat:** rubric-based human scoring (15 per dimension: accuracy, format, tone, safety)
- **Summarization:** ROUGE or human comparison
### Step C — Targeted adversarial probes
Don't just test the happy path. Try:
- Out-of-domain questions (does it gracefully decline or hallucinate?)
- Prompt injection ("ignore previous instructions" — does it stay on rails?)
- Edge cases your training data didn't cover
- Adversarial inputs from real users (typos, ambiguity, sarcasm)
A model that aces the test set and falls apart on out-of-distribution input is overfit. Useful signal — tells you what to add to the training set next round.
---
## 5. Iterate
First fine-tune almost never ships. Plan for at least one retrain.
Diagnose what went wrong and adjust:
| Symptom | Likely cause | Fix |
|---|---|---|
| Model parrots training data verbatim | Overfitting | Fewer epochs, more diverse data, lower LoRA rank |
| Model still sounds like the base, no improvement | Underfitting | More epochs, higher LR, more data, higher LoRA rank |
| Model good on test set, bad on real users | Test set doesn't reflect production distribution | Add real user examples to training data |
| Model forgets general capabilities | Catastrophic forgetting — too much fine-tune-specific data, not enough general | Mix in some general instruction data, smaller LoRA rank, fewer epochs |
| Outputs in wrong format (JSON malformed, etc.) | Inconsistent format in training data | Standardize your training data, possibly add format validation as a post-process |
| Inference too slow | Model too big | Re-quantize to a smaller format (Q4 → Q3), or fine-tune a smaller base |
The platform makes this loop cheap: clone the dataset, edit, retrain with a new job name. Keep both — compare side-by-side.
---
## Quick reference — the four pages you'll live in
```
/models → pick base, browse Hub, see your trained models
/datasets/upload → upload, generate, or synthesize training data
/jobs/create → configure a training run
/jobs/view?id=N → watch a job in progress, see loss curve and logs
```
And for credentials:
```
/apikeys → create API keys for headless / scripted use
/settings → set your HuggingFace token, default model, etc.
```
---
## Common first-week mistakes
1. **Training on a tiny dataset and expecting magic.** Less than ~100 well-curated examples usually doesn't produce a noticeable change. Either get more data or pick a base that already mostly does what you want.
2. **No held-out test set.** You can't measure "is this better" without ground truth you didn't train on.
3. **Calling the first model good because it passed five hand-picked questions.** Five is not a sample size. Twenty is the minimum, fifty is better.
4. **Picking a 70B base because "bigger is better."** Bigger is slower and more expensive and the fine-tune lift is often smaller than what you'd get on a 7B. Start small.
5. **Skipping the base-model sanity check (Step 1 last paragraph).** If the base doesn't get within shouting distance of your task before training, no amount of fine-tuning will close the gap.
---
*Questions, edge cases, or unclear bits? File an issue in the platform repo or ping the team in the dev channel.*