Practical walk-through for someone landing on the platform without context: pick a base model, prepare data, run a fine-tune, test the result. Grounded in what the platform actually does today. Covers the four main pages (/models, /datasets/upload, /jobs/create, /jobs/view) and flags current limitations (NFP-52 chat-test gap for platform-trained models, etc.). Includes an iteration table mapping common failure modes to fixes, and a "first-week mistakes" list. Linked from README alongside the developer MANUAL. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
15 KiB
From Idea to Trained Model — End-User Workflow
A complete walk-through for someone landing on the platform and going from "I want a model that does X" to "I have a model I can test and deploy."
The four phases:
- Pick a model for your purpose
- Get the data to train it on
- Train the model with your data
- Test what came out
Plus what to do when the answer isn't good enough yet — Iterate.
1. Pick a model
A fine-tune is a base model + your data. The base shapes capability ceiling, response style, and inference cost. Get this right, the rest is easier.
Decide what shape of model fits the task
| Task | What you need | Typical base model size |
|---|---|---|
| Chat assistant — answer customer questions, knowledge Q&A | An instruct-tuned chat model | 3B–8B params |
| Structured extraction — pull fields from documents, classify | Smaller instruct or completion model | 1B–3B (faster, cheaper) |
| Code completion — domain-specific code helpers | Code-specialized model | 7B+ |
| Summarization, rewriting | Any instruct model | 3B–7B |
| Embeddings (semantic search) | Embedding-only model | Tiny (270M–1B) |
Rule of thumb on size: smaller is faster and cheaper. Start at the smallest model that demonstrates the capability — fine-tune lifts performance dramatically, but only within the base's reachable space. A 1B model fine-tuned well on a narrow task often beats a 70B general model.
Find a base model
Two sources:
Option A — Already on the platform (fastest)
Open /models → Pulled Models tab. You see Ollama-registered models already loaded and ready. Today's shared instance has examples like llama3.2:3b, qwen2:latest, deepseek-coder:latest, plus several existing fine-tunes (e.g. ustid-extractor-v2).
If one is close to your task, use it. No download wait.
Option B — Pull from HuggingFace
/models → Model Hub tab → search by name or capability. Filter by GGUF format (that's what the platform's inference engine uses).
Practical picks:
- General chat 3B:
bartowski/Llama-3.2-3B-Instruct-GGUF - General chat 7–8B:
bartowski/Meta-Llama-3.1-8B-Instruct-GGUF - Code:
bartowski/Qwen2.5-Coder-7B-Instruct-GGUF - Tiny/fast:
TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
Click Pull. The download runs async with a progress bar. When it finishes, the model appears in Pulled Models with a Chat button (for chat-capable ones) and a Generate button (for completion-only ones).
Gated models like the original Meta Llama require an HF token. Add yours in
/settings→ Integrations → HuggingFace API Key. Without a token, only public/community quants work.
Sanity-check the base before you commit
Before pouring training compute into a fine-tune, chat with the base model first. Ask it 5–10 questions in your domain. You're checking three things:
- Does it understand the language/jargon? If not, you need a bigger base or different domain pre-training — fine-tuning won't close a knowledge gap of that size.
- Does it follow instructions in the format you want? Format you can teach via fine-tune. Capability you can't.
- Is response speed acceptable on your hardware? A 70B model on a 16 GB GPU will be unusable regardless of quality.
If the base fails (1), pick a bigger or different base. Iterate at this step — it's free.
2. Get the data
Where most fine-tunes succeed or fail. Quality > quantity > model size.
How much data do you actually need
| Goal | Examples typically needed |
|---|---|
| Style/tone shift (sound like our brand) | 100–500 |
| Domain knowledge injection (our product manuals) | 500–5,000 |
| Structured extraction (fields out of forms) | 200–2,000 |
| Function-calling / tool-use specialization | 1,000–10,000 |
| New capability the base lacks | 10,000+ (and probably won't work — try a different base) |
These are clean, high-quality examples. 200 hand-curated examples beat 10,000 noisy ones. Always.
Format
The platform expects JSONL (one JSON object per line). Two common shapes:
Instruction format (most flexible):
{"instruction": "Summarize the following support ticket.", "input": "Customer says...", "output": "TL;DR: ..."}
{"instruction": "Classify intent.", "input": "Where's my order?", "output": "order_status"}
Chat format (for chat-style models):
{"messages": [{"role": "user", "content": "What's your return policy?"}, {"role": "assistant", "content": "30 days, full refund..."}]}
The platform auto-detects the format on upload and warns you if rows are malformed.
Three ways to get data into the platform
Open /datasets/upload. Three tabs:
Tab 1 — Upload JSONL (if you already have data)
- Drag-and-drop. Validation runs immediately. Summary shows row count, average tokens per example, format detected, anything broken.
- Acceptable:
.jsonl,.json(will convert if it's an array of objects). - Reject and fix on your end: malformed JSON, missing required fields, examples wildly out-of-distribution.
Tab 2 — Document → Q&A (if you have docs but no examples yet)
- Upload PDFs, Markdown, HTML, plain text. The platform chunks them, runs an LLM over each chunk to generate Q&A pairs grounded in the source, and produces a JSONL.
- Best for: knowledge-base assistants, customer-support bots, internal documentation.
- Always review the generated pairs before training. The LLM doing the generation is not perfect — drop hallucinations, fix ambiguous wordings. The platform shows them in a reviewable table.
- Rough yield: ~3–8 Q&A pairs per page of source material.
Tab 3 — Synthetic generation (if you have nothing yet)
- Describe the persona, task, and example shape. The platform generates examples programmatically against a seed model.
- Useful for: bootstrapping a tone/style fine-tune, augmenting a small real-data seed.
- Limitation: synthetic data drifts toward the generator's biases. Use it to scale a small real-data seed, not as the sole source.
Quality gates to apply yourself before training
- Deduplicate. Identical or near-identical examples waste training cycles and risk overfitting.
- Variety check. Examples should span the input space you'll see in production. If 90% of your training data is one phrasing pattern, the model overfits to that pattern.
- Hold out a test set. Take 10–20% of your examples, set them aside, do NOT include in training. You'll use them in Step 4 to evaluate honestly.
- Sanity-read 20 examples at random. If you cringe at any, your model will produce worse versions of those.
3. Train the model
Open /jobs/create.
Pick your inputs
- Base model — what you settled on in Step 1
- Dataset — what you prepared in Step 2 (excluding your held-out test set)
- Job name — descriptive, you'll see it in the list later. e.g.
support-bot-v1,extractor-invoices-q3
Pick a preset (the easy path)
The platform ships three presets. For your first run, use a preset. Tune later.
| Preset | Epochs | LR | When to use |
|---|---|---|---|
| Quick | 1 | 3e-4 | First test run. ~3–10 min on most models. Verifies the pipeline works end-to-end and the loss curve makes sense. |
| Standard | 3 | 2e-4 | Default for real fine-tunes. Balanced. |
| Thorough | 5 | 1e-4 | Larger datasets, when you need maximum learning. Risk: overfitting on small datasets. |
Advanced parameters (skip on first run)
If you click "Advanced":
- Batch size — 4 default. Lower if you OOM on a small GPU. Higher = faster training but more memory.
- LoRA rank — 16 default. Higher = more capacity to learn (and overfit). Lower = lighter, faster, often sufficient.
- LoRA alpha — 32 default. Convention: 2 × rank.
- Context size — 2048–4096 default. Match this to the longest example in your dataset. Going wider costs memory.
- Quantization —
4bitdefault for training (QLoRA),q4_k_mfor the final GGUF export. Smaller = faster inference, lower quality.
First-run heuristic: don't change any of these. Run Quick, look at the loss curve, decide whether to retrain with Standard.
Submit and watch
Click Start training. You're routed to /jobs/view?id=<n>. You'll see:
- Live status — Queued → Spawning container → Loading model → Training → Exporting → Done
- Progress bar — % of total training steps complete
- Loss curve — updates every few steps, streamed live via WebSocket. Your most important signal.
- GPU utilization — should be >80% while training. If not, you're CPU-bound and something's wrong.
- Logs — raw output. Useful when things go sideways.
Reading the loss curve:
- Goes down smoothly → working as expected
- Plateaus quickly → either model already knows this (good, you're done) or the LR is too low
- Spikes / NaNs → LR too high, or bad data in batch. Stop and investigate.
- Goes down then back up → overfitting; you trained too many epochs. Use the earlier checkpoint.
What you get when it finishes
- A GGUF file of the fine-tuned model in your
/modelslibrary, statusready - A diff metric (loss before/after)
- The Modelfile the platform generated (visible in Models → <your model> → Modelfile tab)
- A unique model name you'll use to call it
Typical runtimes (rough, depends on hardware):
- Quick on 3B base + 500 examples: 3–10 min
- Standard on 8B base + 5,000 examples: 30–90 min
- Thorough on 8B base + 10,000 examples: 2–6 hours
4. Test the result
The biggest mistake at this stage: chatting with your model a few times, feeling impressed, calling it done. Don't.
Step A — Chat with it in the browser
Open /models → your trained model → Chat.
Today's caveat: the chat button shows up for models with a proper chat template. Some platform-trained GGUFs need the Modelfile's template configured before the button appears. If you don't see the button, edit the Modelfile in
/models/<id>/editand confirm theTEMPLATEfield contains{{ .Messages }}. (Tracked as NFP-52 — automatic handling of platform-trained models in the chat tester is in progress.)
Ask 10–20 questions. Compare side-by-side against the base model (open a second chat tab with the un-fine-tuned base). For each question note:
- Did the fine-tune get closer to the right answer?
- Did the fine-tune preserve the base's general capability (did it forget how to be coherent)?
- Did the fine-tune introduce new failure modes?
Step B — Run your held-out test set
Take the 10–20% you set aside in Step 2.
Path 1 — Manual through the chat UI For small test sets (< 50 examples). Paste each prompt, capture the output, score yourself.
Path 2 — API
For larger test sets. The platform exposes an OpenAI-compatible endpoint at https://api.neuronetz.ai/v1/chat/completions (use your API key from /apikeys):
import openai
client = openai.OpenAI(
api_key="<your_platform_api_key>",
base_url="https://api.neuronetz.ai/v1",
)
resp = client.chat.completions.create(
model="your-model-name",
messages=[{"role": "user", "content": "Test prompt here"}],
temperature=0.0, # deterministic for evaluation
)
print(resp.choices[0].message.content)
Loop over your test set, capture outputs, score them. Scoring rubric depends on task:
- Classification / extraction: exact-match or F1 against ground truth
- Generation / chat: rubric-based human scoring (1–5 per dimension: accuracy, format, tone, safety)
- Summarization: ROUGE or human comparison
Step C — Targeted adversarial probes
Don't just test the happy path. Try:
- Out-of-domain questions (does it gracefully decline or hallucinate?)
- Prompt injection ("ignore previous instructions" — does it stay on rails?)
- Edge cases your training data didn't cover
- Adversarial inputs from real users (typos, ambiguity, sarcasm)
A model that aces the test set and falls apart on out-of-distribution input is overfit. Useful signal — tells you what to add to the training set next round.
5. Iterate
First fine-tune almost never ships. Plan for at least one retrain.
Diagnose what went wrong and adjust:
| Symptom | Likely cause | Fix |
|---|---|---|
| Model parrots training data verbatim | Overfitting | Fewer epochs, more diverse data, lower LoRA rank |
| Model still sounds like the base, no improvement | Underfitting | More epochs, higher LR, more data, higher LoRA rank |
| Model good on test set, bad on real users | Test set doesn't reflect production distribution | Add real user examples to training data |
| Model forgets general capabilities | Catastrophic forgetting — too much fine-tune-specific data, not enough general | Mix in some general instruction data, smaller LoRA rank, fewer epochs |
| Outputs in wrong format (JSON malformed, etc.) | Inconsistent format in training data | Standardize your training data, possibly add format validation as a post-process |
| Inference too slow | Model too big | Re-quantize to a smaller format (Q4 → Q3), or fine-tune a smaller base |
The platform makes this loop cheap: clone the dataset, edit, retrain with a new job name. Keep both — compare side-by-side.
Quick reference — the four pages you'll live in
/models → pick base, browse Hub, see your trained models
/datasets/upload → upload, generate, or synthesize training data
/jobs/create → configure a training run
/jobs/view?id=N → watch a job in progress, see loss curve and logs
And for credentials:
/apikeys → create API keys for headless / scripted use
/settings → set your HuggingFace token, default model, etc.
Common first-week mistakes
- Training on a tiny dataset and expecting magic. Less than ~100 well-curated examples usually doesn't produce a noticeable change. Either get more data or pick a base that already mostly does what you want.
- No held-out test set. You can't measure "is this better" without ground truth you didn't train on.
- Calling the first model good because it passed five hand-picked questions. Five is not a sample size. Twenty is the minimum, fifty is better.
- Picking a 70B base because "bigger is better." Bigger is slower and more expensive and the fine-tune lift is often smaller than what you'd get on a 7B. Start small.
- Skipping the base-model sanity check (Step 1 last paragraph). If the base doesn't get within shouting distance of your task before training, no amount of fine-tuning will close the gap.
Questions, edge cases, or unclear bits? File an issue in the platform repo or ping the team in the dev channel.