Stage-1 vertical slice: Pydantic Case model, SQLAlchemy Core persistence, URLhaus Scoutline fetcher, FastAPI/Jinja cockpit (cases list + detail), flat Typer CLI, Result[T, E] type module, structlog config. Architecture in docs/dossier.md; 12-fold style guide in docs/style.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
384 lines
11 KiB
Markdown
384 lines
11 KiB
Markdown
# Blue48 IntelMiner and LoRA Training Data Pipeline
|
||
|
||
**Document type:** Project record / technical concept
|
||
**Scope:** Lawful intelligence collection, training-data preparation, LoRA dataset format, quality gates, safety boundaries
|
||
**Status:** Draft v1
|
||
|
||
---
|
||
|
||
## 1. Purpose
|
||
|
||
IntelMiner is the Blue48 worker responsible for collecting lawful defensive cyber-intelligence and converting it into reviewed, license-safe, LoRA-ready training examples.
|
||
|
||
IntelMiner does not train models to hack. It prepares training data for defensive tasks such as indicator extraction, routing, severity classification, evidence handling, and safe report writing.
|
||
|
||
Core mission:
|
||
|
||
> IntelMiner collects lawful defensive cyber-intelligence from approved online sources and transforms it into reviewed, license-safe, LoRA-ready JSONL examples for specialized defensive models.
|
||
|
||
---
|
||
|
||
## 2. What IntelMiner Should Learn From
|
||
|
||
Allowed source categories:
|
||
|
||
- national CERT advisories
|
||
- CISA, ENISA, NCSC, CERT-EU, BSI, ANSSI, and similar public advisories
|
||
- CVE, NVD, and exploited-vulnerability catalogs
|
||
- public vendor threat reports
|
||
- public malware-analysis reports
|
||
- public ransomware trend reports from lawful monitors
|
||
- MISP events where the license and sharing group permit reuse
|
||
- abuse.ch datasets where permitted
|
||
- public IOCs and defensive detection content
|
||
- public incident writeups
|
||
- internally written reports approved for training
|
||
- synthetic examples written by analysts
|
||
|
||
Restricted or excluded source categories:
|
||
|
||
- raw stolen data
|
||
- raw credentials
|
||
- private victim communications
|
||
- criminal-forum content obtained without authorization
|
||
- confidential CTI provider content without training rights
|
||
- TLP:RED material
|
||
- material with unknown or incompatible license
|
||
- content that teaches exploitation, persistence, credential abuse, ransomware operation, or evasion
|
||
|
||
---
|
||
|
||
## 3. IntelMiner Worker Chain
|
||
|
||
```text
|
||
SourcePlanner
|
||
→ Collector
|
||
→ LicenseChecker
|
||
→ ContentParser
|
||
→ Chunker
|
||
→ Labeler
|
||
→ ExampleBuilder
|
||
→ QualityGate
|
||
→ ReviewerQueue
|
||
→ DatasetWriter
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Worker Responsibilities
|
||
|
||
| Worker | Responsibility |
|
||
|---|---|
|
||
| **SourcePlanner** | Defines approved sources, update schedules, license expectations, and collection priority. |
|
||
| **Collector** | Pulls data from APIs, RSS, advisories, STIX/TAXII, MISP, GitHub, PDFs, and public reports. |
|
||
| **LicenseChecker** | Determines whether the material may be used for training. Blocks unknown or restricted content. |
|
||
| **ContentParser** | Extracts text, IOCs, dates, actors, CVEs, TTPs, victim sectors, and source metadata. |
|
||
| **Chunker** | Splits long content into training-sized units while preserving context. |
|
||
| **Labeler** | Assigns task labels such as IOC extraction, routing, classification, report writing, and evidence handling. |
|
||
| **ExampleBuilder** | Converts chunks into instruction/input/output training examples. |
|
||
| **QualityGate** | Removes unsafe, duplicated, mislabeled, low-confidence, or license-problematic examples. |
|
||
| **ReviewerQueue** | Sends candidates to human reviewers. Nothing enters the final dataset without approval. |
|
||
| **DatasetWriter** | Exports approved examples as versioned JSONL datasets. |
|
||
|
||
---
|
||
|
||
## 5. Training Tasks
|
||
|
||
The LoRA adapters should learn defensive operations only.
|
||
|
||
| Task | Purpose |
|
||
|---|---|
|
||
| **ioc_extraction** | Extract domains, IPs, URLs, hashes, emails, wallets, CVEs, and file names. |
|
||
| **ttp_mapping** | Map report language to MITRE ATT&CK-style techniques. |
|
||
| **severity_classification** | Classify weak signal, credible threat, confirmed exposure, campaign intelligence, or imminent harm. |
|
||
| **routing_decision** | Decide which reporting destinations are appropriate and in what order. |
|
||
| **evidence_handling** | Decide whether evidence must be sealed, minimized, excluded, or internally retained. |
|
||
| **actor_normalization** | Normalize actor names, aliases, ransomware brands, and campaigns. |
|
||
| **source_reliability** | Estimate source reliability and information credibility. |
|
||
| **report_drafting** | Draft structured victim, CERT, provider, MISP, or public reports. |
|
||
| **public_publishing** | Produce sanitized public intelligence after mitigation. |
|
||
|
||
Do not train examples for:
|
||
|
||
- exploitation steps
|
||
- credential abuse
|
||
- phishing construction
|
||
- malware deployment
|
||
- ransomware operations
|
||
- evasion
|
||
- stealth
|
||
- persistence
|
||
- unauthorized forum access
|
||
- instructions for obtaining stolen data
|
||
|
||
---
|
||
|
||
## 6. Recommended LoRA Strategy
|
||
|
||
Do not start by training one large mixed LoRA. Start with small task-specific adapters.
|
||
|
||
Recommended adapter order:
|
||
|
||
| Priority | Adapter | Reason |
|
||
|---:|---|---|
|
||
| 1 | **lora-router** | Central to the project and easier to evaluate objectively. |
|
||
| 2 | **lora-ioc-extractor** | High utility, clear labels, measurable precision and recall. |
|
||
| 3 | **lora-evidence-handler** | Helps enforce safe handling decisions. |
|
||
| 4 | **lora-report-writer** | Drafts structured notifications after reviewed facts exist. |
|
||
| 5 | **lora-actor-normalizer** | Improves actor and campaign mapping. |
|
||
| 6 | **lora-public-publisher** | Produces public-safe summaries after mitigation. |
|
||
|
||
Training should begin only after enough reviewed examples exist:
|
||
|
||
- 1,000+ reviewed examples for a single narrow task, or
|
||
- 3,000–10,000 mixed examples across several tasks.
|
||
|
||
Until then, use rules, retrieval, embeddings, and human-reviewed prompts.
|
||
|
||
---
|
||
|
||
## 7. JSONL Training Format
|
||
|
||
Each JSONL line should contain one training example.
|
||
|
||
Standard structure:
|
||
|
||
```json
|
||
{
|
||
"task": "routing_decision",
|
||
"instruction": "Given a defensive cyber-intelligence signal, choose the correct reporting destinations and order.",
|
||
"input": {},
|
||
"output": {},
|
||
"metadata": {
|
||
"source_type": "public_advisory | vendor_report | synthetic | internal_approved",
|
||
"tlp": "CLEAR | GREEN | AMBER",
|
||
"license": "approved",
|
||
"reviewed": true,
|
||
"policy_version": "v1",
|
||
"dataset_version": "dataset-router-v0.1"
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Example: IOC Extraction
|
||
|
||
```json
|
||
{
|
||
"task": "ioc_extraction",
|
||
"instruction": "Extract defensive indicators from the cyber threat report. Return JSON only.",
|
||
"input": "A phishing campaign used login-example[.]com and delivered payload hash 44d88612fea8a8f36de82e1278abb02f. The actor referenced CVE-2024-12345.",
|
||
"output": {
|
||
"domains": ["login-example.com"],
|
||
"hashes": ["44d88612fea8a8f36de82e1278abb02f"],
|
||
"cves": ["CVE-2024-12345"],
|
||
"ips": [],
|
||
"urls": []
|
||
},
|
||
"metadata": {
|
||
"source_type": "synthetic_or_public_report",
|
||
"tlp": "CLEAR",
|
||
"license": "approved",
|
||
"reviewed": true
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 9. Example: Routing Decision
|
||
|
||
```json
|
||
{
|
||
"task": "routing_decision",
|
||
"instruction": "Given a defensive cyber-intelligence signal, choose the correct reporting destinations and order.",
|
||
"input": {
|
||
"incident_type": "access_sale",
|
||
"victim_country": "DE",
|
||
"sector": "energy",
|
||
"critical_infrastructure": true,
|
||
"confidence": "high",
|
||
"tlp": "AMBER"
|
||
},
|
||
"output": {
|
||
"severity": "critical",
|
||
"routes": [
|
||
"CERT-Bund",
|
||
"victim_security_team",
|
||
"sector_isac",
|
||
"law_enforcement_cyber_unit",
|
||
"misp_trusted_community"
|
||
],
|
||
"evidence_handling": "authority_sealed_package"
|
||
},
|
||
"metadata": {
|
||
"reviewed": true,
|
||
"policy_version": "v1"
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 10. Example: Evidence Handling
|
||
|
||
```json
|
||
{
|
||
"task": "evidence_handling",
|
||
"instruction": "Decide how evidence should be handled before external submission.",
|
||
"input": {
|
||
"evidence_type": "stolen_credentials",
|
||
"destination": "public_abuse_api",
|
||
"contains_pii": true,
|
||
"tlp": "RED"
|
||
},
|
||
"output": {
|
||
"submit_raw": false,
|
||
"handling": "do_not_send_raw_to_public_api",
|
||
"allowed_payload": "metadata_only",
|
||
"sealed_package_required": true,
|
||
"authorized_recipients": ["victim_security_team", "national_cert"]
|
||
},
|
||
"metadata": {
|
||
"reviewed": true
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 11. Dataset Metadata
|
||
|
||
Every example should include metadata.
|
||
|
||
| Field | Purpose |
|
||
|---|---|
|
||
| `task` | Training task category. |
|
||
| `source_type` | Origin category of the example. |
|
||
| `source_id` | Internal reference to source document. |
|
||
| `license` | Approved, restricted, unknown, or rejected. |
|
||
| `tlp` | CLEAR, GREEN, AMBER, or RED. |
|
||
| `reviewed` | Human approval status. |
|
||
| `reviewer_id` | Internal reviewer identity or role ID. |
|
||
| `policy_version` | Version of handling policy used. |
|
||
| `dataset_version` | Versioned dataset name. |
|
||
| `safety_flags` | Unsafe content or sensitive material flags. |
|
||
| `dedupe_hash` | Used to prevent duplicate examples. |
|
||
|
||
---
|
||
|
||
## 12. QualityGate Rules
|
||
|
||
QualityGate must reject examples that contain:
|
||
|
||
- raw credentials
|
||
- raw stolen data
|
||
- private victim information
|
||
- live access details
|
||
- exploit chains
|
||
- malware deployment steps
|
||
- phishing instructions
|
||
- evasion or persistence guidance
|
||
- incompatible license
|
||
- unknown provenance
|
||
- duplicated content
|
||
- unreviewed TLP:RED or confidential content
|
||
|
||
QualityGate should flag for human review when:
|
||
|
||
- source license is ambiguous
|
||
- actor attribution is uncertain
|
||
- victim identity is named
|
||
- sample contains personal data
|
||
- output teaches operationally sensitive details
|
||
- example conflicts with policy
|
||
|
||
---
|
||
|
||
## 13. Dataset Builder UI Requirements
|
||
|
||
IntelMiner should be visible in the Blue48 Operations Cockpit.
|
||
|
||
Screens:
|
||
|
||
| Screen | Purpose |
|
||
|---|---|
|
||
| **Dataset Sources** | Manage approved sources, license status, and collection schedules. |
|
||
| **Training Candidate Queue** | Review generated examples before approval. |
|
||
| **Example Review** | Edit, approve, reject, or mark examples unsafe. |
|
||
| **Dataset Builder** | Export versioned JSONL datasets with train/validation split. |
|
||
| **Dataset Audit** | Track source, reviewer, license, and policy version. |
|
||
|
||
Candidate fields:
|
||
|
||
| Field | Meaning |
|
||
|---|---|
|
||
| Task | IOC extraction, routing, classification, etc. |
|
||
| Source | advisory, blog, report, synthetic, internal. |
|
||
| License | approved, restricted, unknown, rejected. |
|
||
| Quality score | Estimated usefulness. |
|
||
| Safety flag | safe, needs review, reject. |
|
||
| Reviewer status | pending, approved, rejected. |
|
||
|
||
---
|
||
|
||
## 14. Dataset Versioning
|
||
|
||
Datasets should be versioned clearly:
|
||
|
||
```text
|
||
dataset-router-v0.1
|
||
dataset-ioc-extractor-v0.3
|
||
dataset-evidence-handler-v0.2
|
||
dataset-report-writer-v0.2
|
||
```
|
||
|
||
Each export should include:
|
||
|
||
- dataset name
|
||
- version
|
||
- date
|
||
- number of examples
|
||
- task distribution
|
||
- source distribution
|
||
- license distribution
|
||
- reviewer count
|
||
- rejected example count
|
||
- train/validation split
|
||
- policy version
|
||
|
||
---
|
||
|
||
## 15. Human Review Requirements
|
||
|
||
Human approval is required before examples become training data.
|
||
|
||
Reviewers should check:
|
||
|
||
- factual correctness
|
||
- source license
|
||
- safety boundaries
|
||
- absence of raw sensitive data
|
||
- correct label
|
||
- useful expected output
|
||
- no attacker-enabling content
|
||
|
||
Two-person review is recommended for:
|
||
|
||
- internal case-derived examples
|
||
- sensitive incident examples
|
||
- actor attribution examples
|
||
- routing examples involving law enforcement or critical infrastructure
|
||
- examples derived from TLP:AMBER material
|
||
|
||
TLP:RED material should not be used for LoRA training unless an explicit legal, operational, and governance policy exists.
|
||
|
||
---
|
||
|
||
## 16. Summary
|
||
|
||
IntelMiner is the bridge between Blue48 operations and future specialized defensive models.
|
||
|
||
It should collect only lawful and approved data, check license and safety constraints, build structured examples, require human review, and export versioned JSONL datasets. The first LoRA should likely be `lora-router`, followed by `lora-ioc-extractor` and `lora-evidence-handler`.
|