m17hr1l/finetuning-plattform-setup-delta

Files

m17hr1l 33635796ee docs: add USER_WORKFLOW.md — end-user journey from idea to trained model

Practical walk-through for someone landing on the platform without
context: pick a base model, prepare data, run a fine-tune, test the
result. Grounded in what the platform actually does today.

Covers the four main pages (/models, /datasets/upload, /jobs/create,
/jobs/view) and flags current limitations (NFP-52 chat-test gap for
platform-trained models, etc.). Includes an iteration table mapping
common failure modes to fixes, and a "first-week mistakes" list.

Linked from README alongside the developer MANUAL.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-20 22:57:08 +02:00

15 KiB

Raw Blame History

From Idea to Trained Model — End-User Workflow

A complete walk-through for someone landing on the platform and going from "I want a model that does X" to "I have a model I can test and deploy."

The four phases:

Pick a model for your purpose
Get the data to train it on
Train the model with your data
Test what came out

Plus what to do when the answer isn't good enough yet — Iterate.

1. Pick a model

A fine-tune is a base model + your data. The base shapes capability ceiling, response style, and inference cost. Get this right, the rest is easier.

Decide what shape of model fits the task

Task	What you need	Typical base model size
Chat assistant — answer customer questions, knowledge Q&A	An instruct-tuned chat model	3B–8B params
Structured extraction — pull fields from documents, classify	Smaller instruct or completion model	1B–3B (faster, cheaper)
Code completion — domain-specific code helpers	Code-specialized model	7B+
Summarization, rewriting	Any instruct model	3B–7B
Embeddings (semantic search)	Embedding-only model	Tiny (270M–1B)

Rule of thumb on size: smaller is faster and cheaper. Start at the smallest model that demonstrates the capability — fine-tune lifts performance dramatically, but only within the base's reachable space. A 1B model fine-tuned well on a narrow task often beats a 70B general model.

Find a base model

Two sources:

Option A — Already on the platform (fastest)

Open /models → Pulled Models tab. You see Ollama-registered models already loaded and ready. Today's shared instance has examples like llama3.2:3b, qwen2:latest, deepseek-coder:latest, plus several existing fine-tunes (e.g. ustid-extractor-v2).

If one is close to your task, use it. No download wait.

Option B — Pull from HuggingFace

/models → Model Hub tab → search by name or capability. Filter by GGUF format (that's what the platform's inference engine uses).

Practical picks:

General chat 3B: bartowski/Llama-3.2-3B-Instruct-GGUF
General chat 7–8B: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
Code: bartowski/Qwen2.5-Coder-7B-Instruct-GGUF
Tiny/fast: TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF

Click Pull. The download runs async with a progress bar. When it finishes, the model appears in Pulled Models with a Chat button (for chat-capable ones) and a Generate button (for completion-only ones).

Gated models like the original Meta Llama require an HF token. Add yours in /settings → Integrations → HuggingFace API Key. Without a token, only public/community quants work.

Sanity-check the base before you commit

Before pouring training compute into a fine-tune, chat with the base model first. Ask it 5–10 questions in your domain. You're checking three things:

Does it understand the language/jargon? If not, you need a bigger base or different domain pre-training — fine-tuning won't close a knowledge gap of that size.
Does it follow instructions in the format you want? Format you can teach via fine-tune. Capability you can't.
Is response speed acceptable on your hardware? A 70B model on a 16 GB GPU will be unusable regardless of quality.

If the base fails (1), pick a bigger or different base. Iterate at this step — it's free.

2. Get the data

Where most fine-tunes succeed or fail. Quality > quantity > model size.

How much data do you actually need

Goal	Examples typically needed
Style/tone shift (sound like our brand)	100–500
Domain knowledge injection (our product manuals)	500–5,000
Structured extraction (fields out of forms)	200–2,000
Function-calling / tool-use specialization	1,000–10,000
New capability the base lacks	10,000+ (and probably won't work — try a different base)

These are clean, high-quality examples. 200 hand-curated examples beat 10,000 noisy ones. Always.

Format

The platform expects JSONL (one JSON object per line). Two common shapes:

Instruction format (most flexible):

{"instruction": "Summarize the following support ticket.", "input": "Customer says...", "output": "TL;DR: ..."}
{"instruction": "Classify intent.", "input": "Where's my order?", "output": "order_status"}

Chat format (for chat-style models):

{"messages": [{"role": "user", "content": "What's your return policy?"}, {"role": "assistant", "content": "30 days, full refund..."}]}

The platform auto-detects the format on upload and warns you if rows are malformed.

Three ways to get data into the platform

Open /datasets/upload. Three tabs:

Tab 1 — Upload JSONL (if you already have data)

Drag-and-drop. Validation runs immediately. Summary shows row count, average tokens per example, format detected, anything broken.
Acceptable: .jsonl, .json (will convert if it's an array of objects).
Reject and fix on your end: malformed JSON, missing required fields, examples wildly out-of-distribution.

Tab 2 — Document → Q&A (if you have docs but no examples yet)

Upload PDFs, Markdown, HTML, plain text. The platform chunks them, runs an LLM over each chunk to generate Q&A pairs grounded in the source, and produces a JSONL.
Best for: knowledge-base assistants, customer-support bots, internal documentation.
Always review the generated pairs before training. The LLM doing the generation is not perfect — drop hallucinations, fix ambiguous wordings. The platform shows them in a reviewable table.
Rough yield: ~3–8 Q&A pairs per page of source material.

Tab 3 — Synthetic generation (if you have nothing yet)

Describe the persona, task, and example shape. The platform generates examples programmatically against a seed model.
Useful for: bootstrapping a tone/style fine-tune, augmenting a small real-data seed.
Limitation: synthetic data drifts toward the generator's biases. Use it to scale a small real-data seed, not as the sole source.

Quality gates to apply yourself before training

Deduplicate. Identical or near-identical examples waste training cycles and risk overfitting.
Variety check. Examples should span the input space you'll see in production. If 90% of your training data is one phrasing pattern, the model overfits to that pattern.
Hold out a test set. Take 10–20% of your examples, set them aside, do NOT include in training. You'll use them in Step 4 to evaluate honestly.
Sanity-read 20 examples at random. If you cringe at any, your model will produce worse versions of those.

3. Train the model

Open /jobs/create.

Pick your inputs

Base model — what you settled on in Step 1
Dataset — what you prepared in Step 2 (excluding your held-out test set)
Job name — descriptive, you'll see it in the list later. e.g. support-bot-v1, extractor-invoices-q3

Pick a preset (the easy path)

The platform ships three presets. For your first run, use a preset. Tune later.

Preset	Epochs	LR	When to use
Quick	1	3e-4	First test run. ~3–10 min on most models. Verifies the pipeline works end-to-end and the loss curve makes sense.
Standard	3	2e-4	Default for real fine-tunes. Balanced.
Thorough	5	1e-4	Larger datasets, when you need maximum learning. Risk: overfitting on small datasets.

Advanced parameters (skip on first run)

If you click "Advanced":

Batch size — 4 default. Lower if you OOM on a small GPU. Higher = faster training but more memory.
LoRA rank — 16 default. Higher = more capacity to learn (and overfit). Lower = lighter, faster, often sufficient.
LoRA alpha — 32 default. Convention: 2 × rank.
Context size — 2048–4096 default. Match this to the longest example in your dataset. Going wider costs memory.
Quantization — 4bit default for training (QLoRA), q4_k_m for the final GGUF export. Smaller = faster inference, lower quality.

First-run heuristic: don't change any of these. Run Quick, look at the loss curve, decide whether to retrain with Standard.

Submit and watch

Click Start training. You're routed to /jobs/view?id=<n>. You'll see:

Live status — Queued → Spawning container → Loading model → Training → Exporting → Done
Progress bar — % of total training steps complete
Loss curve — updates every few steps, streamed live via WebSocket. Your most important signal.
GPU utilization — should be >80% while training. If not, you're CPU-bound and something's wrong.
Logs — raw output. Useful when things go sideways.

Reading the loss curve:

Goes down smoothly → working as expected
Plateaus quickly → either model already knows this (good, you're done) or the LR is too low
Spikes / NaNs → LR too high, or bad data in batch. Stop and investigate.
Goes down then back up → overfitting; you trained too many epochs. Use the earlier checkpoint.

What you get when it finishes

A GGUF file of the fine-tuned model in your /models library, status ready
A diff metric (loss before/after)
The Modelfile the platform generated (visible in Models → <your model> → Modelfile tab)
A unique model name you'll use to call it

Typical runtimes (rough, depends on hardware):

Quick on 3B base + 500 examples: 3–10 min
Standard on 8B base + 5,000 examples: 30–90 min
Thorough on 8B base + 10,000 examples: 2–6 hours

4. Test the result

The biggest mistake at this stage: chatting with your model a few times, feeling impressed, calling it done. Don't.

Step A — Chat with it in the browser

Open /models → your trained model → Chat.

Today's caveat: the chat button shows up for models with a proper chat template. Some platform-trained GGUFs need the Modelfile's template configured before the button appears. If you don't see the button, edit the Modelfile in /models/<id>/edit and confirm the TEMPLATE field contains {{ .Messages }}. (Tracked as NFP-52 — automatic handling of platform-trained models in the chat tester is in progress.)

Ask 10–20 questions. Compare side-by-side against the base model (open a second chat tab with the un-fine-tuned base). For each question note:

Did the fine-tune get closer to the right answer?
Did the fine-tune preserve the base's general capability (did it forget how to be coherent)?
Did the fine-tune introduce new failure modes?

Step B — Run your held-out test set

Take the 10–20% you set aside in Step 2.

Path 1 — Manual through the chat UI For small test sets (< 50 examples). Paste each prompt, capture the output, score yourself.

Path 2 — API For larger test sets. The platform exposes an OpenAI-compatible endpoint at https://api.neuronetz.ai/v1/chat/completions (use your API key from /apikeys):

import openai

client = openai.OpenAI(
    api_key="<your_platform_api_key>",
    base_url="https://api.neuronetz.ai/v1",
)

resp = client.chat.completions.create(
    model="your-model-name",
    messages=[{"role": "user", "content": "Test prompt here"}],
    temperature=0.0,  # deterministic for evaluation
)
print(resp.choices[0].message.content)

Loop over your test set, capture outputs, score them. Scoring rubric depends on task:

Classification / extraction: exact-match or F1 against ground truth
Generation / chat: rubric-based human scoring (1–5 per dimension: accuracy, format, tone, safety)
Summarization: ROUGE or human comparison

Step C — Targeted adversarial probes

Don't just test the happy path. Try:

Out-of-domain questions (does it gracefully decline or hallucinate?)
Prompt injection ("ignore previous instructions" — does it stay on rails?)
Edge cases your training data didn't cover
Adversarial inputs from real users (typos, ambiguity, sarcasm)

A model that aces the test set and falls apart on out-of-distribution input is overfit. Useful signal — tells you what to add to the training set next round.

5. Iterate

First fine-tune almost never ships. Plan for at least one retrain.

Diagnose what went wrong and adjust:

Symptom	Likely cause	Fix
Model parrots training data verbatim	Overfitting	Fewer epochs, more diverse data, lower LoRA rank
Model still sounds like the base, no improvement	Underfitting	More epochs, higher LR, more data, higher LoRA rank
Model good on test set, bad on real users	Test set doesn't reflect production distribution	Add real user examples to training data
Model forgets general capabilities	Catastrophic forgetting — too much fine-tune-specific data, not enough general	Mix in some general instruction data, smaller LoRA rank, fewer epochs
Outputs in wrong format (JSON malformed, etc.)	Inconsistent format in training data	Standardize your training data, possibly add format validation as a post-process
Inference too slow	Model too big	Re-quantize to a smaller format (Q4 → Q3), or fine-tune a smaller base

The platform makes this loop cheap: clone the dataset, edit, retrain with a new job name. Keep both — compare side-by-side.

Quick reference — the four pages you'll live in

/models           → pick base, browse Hub, see your trained models
/datasets/upload  → upload, generate, or synthesize training data
/jobs/create      → configure a training run
/jobs/view?id=N   → watch a job in progress, see loss curve and logs

And for credentials:

/apikeys          → create API keys for headless / scripted use
/settings         → set your HuggingFace token, default model, etc.

Common first-week mistakes

Training on a tiny dataset and expecting magic. Less than ~100 well-curated examples usually doesn't produce a noticeable change. Either get more data or pick a base that already mostly does what you want.
No held-out test set. You can't measure "is this better" without ground truth you didn't train on.
Calling the first model good because it passed five hand-picked questions. Five is not a sample size. Twenty is the minimum, fifty is better.
Picking a 70B base because "bigger is better." Bigger is slower and more expensive and the fine-tune lift is often smaller than what you'd get on a 7B. Start small.
Skipping the base-model sanity check (Step 1 last paragraph). If the base doesn't get within shouting distance of your task before training, no amount of fine-tuning will close the gap.

Questions, edge cases, or unclear bits? File an issue in the platform repo or ping the team in the dev channel.

15 KiB Raw Blame History Unescape Escape