diff --git a/README.md b/README.md index bddee4b..002e3da 100644 --- a/README.md +++ b/README.md @@ -3,6 +3,7 @@ Drop-in setup for new developers joining the platform. → **Start with [`MANUAL.md`](MANUAL.md)** — full developer manual with quick-start at the top. +→ End-user perspective: **[`USER_WORKFLOW.md`](USER_WORKFLOW.md)** — idea → data → train → test, in plain language. ## Quick orientation diff --git a/USER_WORKFLOW.md b/USER_WORKFLOW.md new file mode 100644 index 0000000..391dac2 --- /dev/null +++ b/USER_WORKFLOW.md @@ -0,0 +1,301 @@ +# From Idea to Trained Model — End-User Workflow + +A complete walk-through for someone landing on the platform and going from "I want a model that does X" to "I have a model I can test and deploy." + +The four phases: + +1. [**Pick a model** for your purpose](#1-pick-a-model) +2. [**Get the data** to train it on](#2-get-the-data) +3. [**Train** the model with your data](#3-train-the-model) +4. [**Test** what came out](#4-test-the-result) + +Plus what to do when the answer isn't good enough yet — [Iterate](#5-iterate). + +--- + +## 1. Pick a model + +A fine-tune is a **base model + your data**. The base shapes capability ceiling, response style, and inference cost. Get this right, the rest is easier. + +### Decide what shape of model fits the task + +| Task | What you need | Typical base model size | +|---|---|---| +| **Chat assistant** — answer customer questions, knowledge Q&A | An instruct-tuned chat model | 3B–8B params | +| **Structured extraction** — pull fields from documents, classify | Smaller instruct or completion model | 1B–3B (faster, cheaper) | +| **Code completion** — domain-specific code helpers | Code-specialized model | 7B+ | +| **Summarization, rewriting** | Any instruct model | 3B–7B | +| **Embeddings** (semantic search) | Embedding-only model | Tiny (270M–1B) | + +**Rule of thumb on size:** smaller is faster and cheaper. Start at the smallest model that demonstrates the capability — fine-tune lifts performance dramatically, but only within the base's reachable space. A 1B model fine-tuned well on a narrow task often beats a 70B general model. + +### Find a base model + +Two sources: + +**Option A — Already on the platform (fastest)** + +Open `/models` → **Pulled Models** tab. You see Ollama-registered models already loaded and ready. Today's shared instance has examples like `llama3.2:3b`, `qwen2:latest`, `deepseek-coder:latest`, plus several existing fine-tunes (e.g. `ustid-extractor-v2`). + +If one is close to your task, use it. No download wait. + +**Option B — Pull from HuggingFace** + +`/models` → **Model Hub** tab → search by name or capability. Filter by `GGUF` format (that's what the platform's inference engine uses). + +Practical picks: +- General chat 3B: `bartowski/Llama-3.2-3B-Instruct-GGUF` +- General chat 7–8B: `bartowski/Meta-Llama-3.1-8B-Instruct-GGUF` +- Code: `bartowski/Qwen2.5-Coder-7B-Instruct-GGUF` +- Tiny/fast: `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF` + +Click **Pull**. The download runs async with a progress bar. When it finishes, the model appears in **Pulled Models** with a Chat button (for chat-capable ones) and a Generate button (for completion-only ones). + +> **Gated models** like the original Meta Llama require an HF token. Add yours in `/settings` → Integrations → HuggingFace API Key. Without a token, only public/community quants work. + +### Sanity-check the base before you commit + +Before pouring training compute into a fine-tune, chat with the base model first. Ask it 5–10 questions in your domain. You're checking three things: + +1. **Does it understand the language/jargon?** If not, you need a bigger base or different domain pre-training — fine-tuning won't close a knowledge gap of that size. +2. **Does it follow instructions in the format you want?** Format you can teach via fine-tune. Capability you can't. +3. **Is response speed acceptable on your hardware?** A 70B model on a 16 GB GPU will be unusable regardless of quality. + +If the base fails (1), pick a bigger or different base. Iterate at this step — it's free. + +--- + +## 2. Get the data + +Where most fine-tunes succeed or fail. Quality > quantity > model size. + +### How much data do you actually need + +| Goal | Examples typically needed | +|---|---| +| Style/tone shift (sound like our brand) | 100–500 | +| Domain knowledge injection (our product manuals) | 500–5,000 | +| Structured extraction (fields out of forms) | 200–2,000 | +| Function-calling / tool-use specialization | 1,000–10,000 | +| New capability the base lacks | 10,000+ (and probably won't work — try a different base) | + +These are **clean, high-quality** examples. 200 hand-curated examples beat 10,000 noisy ones. Always. + +### Format + +The platform expects **JSONL** (one JSON object per line). Two common shapes: + +**Instruction format** (most flexible): +```json +{"instruction": "Summarize the following support ticket.", "input": "Customer says...", "output": "TL;DR: ..."} +{"instruction": "Classify intent.", "input": "Where's my order?", "output": "order_status"} +``` + +**Chat format** (for chat-style models): +```json +{"messages": [{"role": "user", "content": "What's your return policy?"}, {"role": "assistant", "content": "30 days, full refund..."}]} +``` + +The platform auto-detects the format on upload and warns you if rows are malformed. + +### Three ways to get data into the platform + +Open `/datasets/upload`. Three tabs: + +**Tab 1 — Upload JSONL** *(if you already have data)* +- Drag-and-drop. Validation runs immediately. Summary shows row count, average tokens per example, format detected, anything broken. +- Acceptable: `.jsonl`, `.json` (will convert if it's an array of objects). +- Reject and fix on your end: malformed JSON, missing required fields, examples wildly out-of-distribution. + +**Tab 2 — Document → Q&A** *(if you have docs but no examples yet)* +- Upload PDFs, Markdown, HTML, plain text. The platform chunks them, runs an LLM over each chunk to generate Q&A pairs grounded in the source, and produces a JSONL. +- Best for: knowledge-base assistants, customer-support bots, internal documentation. +- Always **review the generated pairs before training**. The LLM doing the generation is not perfect — drop hallucinations, fix ambiguous wordings. The platform shows them in a reviewable table. +- Rough yield: ~3–8 Q&A pairs per page of source material. + +**Tab 3 — Synthetic generation** *(if you have nothing yet)* +- Describe the persona, task, and example shape. The platform generates examples programmatically against a seed model. +- Useful for: bootstrapping a tone/style fine-tune, augmenting a small real-data seed. +- Limitation: synthetic data drifts toward the generator's biases. Use it to scale a small real-data seed, not as the sole source. + +### Quality gates to apply yourself before training + +- **Deduplicate.** Identical or near-identical examples waste training cycles and risk overfitting. +- **Variety check.** Examples should span the input space you'll see in production. If 90% of your training data is one phrasing pattern, the model overfits to that pattern. +- **Hold out a test set.** Take 10–20% of your examples, set them aside, do NOT include in training. You'll use them in Step 4 to evaluate honestly. +- **Sanity-read 20 examples at random.** If you cringe at any, your model will produce worse versions of those. + +--- + +## 3. Train the model + +Open `/jobs/create`. + +### Pick your inputs + +- **Base model** — what you settled on in Step 1 +- **Dataset** — what you prepared in Step 2 (excluding your held-out test set) +- **Job name** — descriptive, you'll see it in the list later. e.g. `support-bot-v1`, `extractor-invoices-q3` + +### Pick a preset (the easy path) + +The platform ships three presets. For your first run, **use a preset.** Tune later. + +| Preset | Epochs | LR | When to use | +|---|---|---|---| +| **Quick** | 1 | 3e-4 | First test run. ~3–10 min on most models. Verifies the pipeline works end-to-end and the loss curve makes sense. | +| **Standard** | 3 | 2e-4 | Default for real fine-tunes. Balanced. | +| **Thorough** | 5 | 1e-4 | Larger datasets, when you need maximum learning. Risk: overfitting on small datasets. | + +### Advanced parameters (skip on first run) + +If you click "Advanced": + +- **Batch size** — 4 default. Lower if you OOM on a small GPU. Higher = faster training but more memory. +- **LoRA rank** — 16 default. Higher = more capacity to learn (and overfit). Lower = lighter, faster, often sufficient. +- **LoRA alpha** — 32 default. Convention: 2 × rank. +- **Context size** — 2048–4096 default. Match this to the longest example in your dataset. Going wider costs memory. +- **Quantization** — `4bit` default for training (QLoRA), `q4_k_m` for the final GGUF export. Smaller = faster inference, lower quality. + +> **First-run heuristic:** don't change any of these. Run Quick, look at the loss curve, decide whether to retrain with Standard. + +### Submit and watch + +Click **Start training**. You're routed to `/jobs/view?id=`. You'll see: + +- **Live status** — Queued → Spawning container → Loading model → Training → Exporting → Done +- **Progress bar** — % of total training steps complete +- **Loss curve** — updates every few steps, streamed live via WebSocket. Your most important signal. +- **GPU utilization** — should be >80% while training. If not, you're CPU-bound and something's wrong. +- **Logs** — raw output. Useful when things go sideways. + +**Reading the loss curve:** + +- **Goes down smoothly** → working as expected +- **Plateaus quickly** → either model already knows this (good, you're done) or the LR is too low +- **Spikes / NaNs** → LR too high, or bad data in batch. Stop and investigate. +- **Goes down then back up** → overfitting; you trained too many epochs. Use the earlier checkpoint. + +### What you get when it finishes + +- A GGUF file of the fine-tuned model in your `/models` library, status **`ready`** +- A diff metric (loss before/after) +- The Modelfile the platform generated (visible in **Models → \ → Modelfile** tab) +- A unique model name you'll use to call it + +Typical runtimes (rough, depends on hardware): +- Quick on 3B base + 500 examples: 3–10 min +- Standard on 8B base + 5,000 examples: 30–90 min +- Thorough on 8B base + 10,000 examples: 2–6 hours + +--- + +## 4. Test the result + +The biggest mistake at this stage: chatting with your model a few times, feeling impressed, calling it done. Don't. + +### Step A — Chat with it in the browser + +Open `/models` → your trained model → **Chat**. + +> **Today's caveat:** the chat button shows up for models with a proper chat template. Some platform-trained GGUFs need the Modelfile's template configured before the button appears. If you don't see the button, edit the Modelfile in `/models//edit` and confirm the `TEMPLATE` field contains `{{ .Messages }}`. (Tracked as NFP-52 — automatic handling of platform-trained models in the chat tester is in progress.) + +Ask 10–20 questions. Compare side-by-side against the base model (open a second chat tab with the un-fine-tuned base). For each question note: + +- Did the fine-tune get closer to the right answer? +- Did the fine-tune preserve the base's general capability (did it forget how to be coherent)? +- Did the fine-tune introduce new failure modes? + +### Step B — Run your held-out test set + +Take the 10–20% you set aside in Step 2. + +**Path 1 — Manual through the chat UI** +For small test sets (< 50 examples). Paste each prompt, capture the output, score yourself. + +**Path 2 — API** +For larger test sets. The platform exposes an OpenAI-compatible endpoint at `https://api.neuronetz.ai/v1/chat/completions` (use your API key from `/apikeys`): + +```python +import openai + +client = openai.OpenAI( + api_key="", + base_url="https://api.neuronetz.ai/v1", +) + +resp = client.chat.completions.create( + model="your-model-name", + messages=[{"role": "user", "content": "Test prompt here"}], + temperature=0.0, # deterministic for evaluation +) +print(resp.choices[0].message.content) +``` + +Loop over your test set, capture outputs, score them. Scoring rubric depends on task: + +- **Classification / extraction:** exact-match or F1 against ground truth +- **Generation / chat:** rubric-based human scoring (1–5 per dimension: accuracy, format, tone, safety) +- **Summarization:** ROUGE or human comparison + +### Step C — Targeted adversarial probes + +Don't just test the happy path. Try: + +- Out-of-domain questions (does it gracefully decline or hallucinate?) +- Prompt injection ("ignore previous instructions" — does it stay on rails?) +- Edge cases your training data didn't cover +- Adversarial inputs from real users (typos, ambiguity, sarcasm) + +A model that aces the test set and falls apart on out-of-distribution input is overfit. Useful signal — tells you what to add to the training set next round. + +--- + +## 5. Iterate + +First fine-tune almost never ships. Plan for at least one retrain. + +Diagnose what went wrong and adjust: + +| Symptom | Likely cause | Fix | +|---|---|---| +| Model parrots training data verbatim | Overfitting | Fewer epochs, more diverse data, lower LoRA rank | +| Model still sounds like the base, no improvement | Underfitting | More epochs, higher LR, more data, higher LoRA rank | +| Model good on test set, bad on real users | Test set doesn't reflect production distribution | Add real user examples to training data | +| Model forgets general capabilities | Catastrophic forgetting — too much fine-tune-specific data, not enough general | Mix in some general instruction data, smaller LoRA rank, fewer epochs | +| Outputs in wrong format (JSON malformed, etc.) | Inconsistent format in training data | Standardize your training data, possibly add format validation as a post-process | +| Inference too slow | Model too big | Re-quantize to a smaller format (Q4 → Q3), or fine-tune a smaller base | + +The platform makes this loop cheap: clone the dataset, edit, retrain with a new job name. Keep both — compare side-by-side. + +--- + +## Quick reference — the four pages you'll live in + +``` +/models → pick base, browse Hub, see your trained models +/datasets/upload → upload, generate, or synthesize training data +/jobs/create → configure a training run +/jobs/view?id=N → watch a job in progress, see loss curve and logs +``` + +And for credentials: + +``` +/apikeys → create API keys for headless / scripted use +/settings → set your HuggingFace token, default model, etc. +``` + +--- + +## Common first-week mistakes + +1. **Training on a tiny dataset and expecting magic.** Less than ~100 well-curated examples usually doesn't produce a noticeable change. Either get more data or pick a base that already mostly does what you want. +2. **No held-out test set.** You can't measure "is this better" without ground truth you didn't train on. +3. **Calling the first model good because it passed five hand-picked questions.** Five is not a sample size. Twenty is the minimum, fifty is better. +4. **Picking a 70B base because "bigger is better."** Bigger is slower and more expensive and the fine-tune lift is often smaller than what you'd get on a 7B. Start small. +5. **Skipping the base-model sanity check (Step 1 last paragraph).** If the base doesn't get within shouting distance of your task before training, no amount of fine-tuning will close the gap. + +--- + +*Questions, edge cases, or unclear bits? File an issue in the platform repo or ping the team in the dev channel.*