Files
psyc/docs/archive/intelminer.md
m17hr1l e04c6c96d8 init: scaffold psyc — defensive CTI routing & evidence-sealing platform
Stage-1 vertical slice: Pydantic Case model, SQLAlchemy Core persistence,
URLhaus Scoutline fetcher, FastAPI/Jinja cockpit (cases list + detail),
flat Typer CLI, Result[T, E] type module, structlog config.
Architecture in docs/dossier.md; 12-fold style guide in docs/style.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-14 12:43:47 +02:00

384 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Blue48 IntelMiner and LoRA Training Data Pipeline
**Document type:** Project record / technical concept
**Scope:** Lawful intelligence collection, training-data preparation, LoRA dataset format, quality gates, safety boundaries
**Status:** Draft v1
---
## 1. Purpose
IntelMiner is the Blue48 worker responsible for collecting lawful defensive cyber-intelligence and converting it into reviewed, license-safe, LoRA-ready training examples.
IntelMiner does not train models to hack. It prepares training data for defensive tasks such as indicator extraction, routing, severity classification, evidence handling, and safe report writing.
Core mission:
> IntelMiner collects lawful defensive cyber-intelligence from approved online sources and transforms it into reviewed, license-safe, LoRA-ready JSONL examples for specialized defensive models.
---
## 2. What IntelMiner Should Learn From
Allowed source categories:
- national CERT advisories
- CISA, ENISA, NCSC, CERT-EU, BSI, ANSSI, and similar public advisories
- CVE, NVD, and exploited-vulnerability catalogs
- public vendor threat reports
- public malware-analysis reports
- public ransomware trend reports from lawful monitors
- MISP events where the license and sharing group permit reuse
- abuse.ch datasets where permitted
- public IOCs and defensive detection content
- public incident writeups
- internally written reports approved for training
- synthetic examples written by analysts
Restricted or excluded source categories:
- raw stolen data
- raw credentials
- private victim communications
- criminal-forum content obtained without authorization
- confidential CTI provider content without training rights
- TLP:RED material
- material with unknown or incompatible license
- content that teaches exploitation, persistence, credential abuse, ransomware operation, or evasion
---
## 3. IntelMiner Worker Chain
```text
SourcePlanner
→ Collector
→ LicenseChecker
→ ContentParser
→ Chunker
→ Labeler
→ ExampleBuilder
→ QualityGate
→ ReviewerQueue
→ DatasetWriter
```
---
## 4. Worker Responsibilities
| Worker | Responsibility |
|---|---|
| **SourcePlanner** | Defines approved sources, update schedules, license expectations, and collection priority. |
| **Collector** | Pulls data from APIs, RSS, advisories, STIX/TAXII, MISP, GitHub, PDFs, and public reports. |
| **LicenseChecker** | Determines whether the material may be used for training. Blocks unknown or restricted content. |
| **ContentParser** | Extracts text, IOCs, dates, actors, CVEs, TTPs, victim sectors, and source metadata. |
| **Chunker** | Splits long content into training-sized units while preserving context. |
| **Labeler** | Assigns task labels such as IOC extraction, routing, classification, report writing, and evidence handling. |
| **ExampleBuilder** | Converts chunks into instruction/input/output training examples. |
| **QualityGate** | Removes unsafe, duplicated, mislabeled, low-confidence, or license-problematic examples. |
| **ReviewerQueue** | Sends candidates to human reviewers. Nothing enters the final dataset without approval. |
| **DatasetWriter** | Exports approved examples as versioned JSONL datasets. |
---
## 5. Training Tasks
The LoRA adapters should learn defensive operations only.
| Task | Purpose |
|---|---|
| **ioc_extraction** | Extract domains, IPs, URLs, hashes, emails, wallets, CVEs, and file names. |
| **ttp_mapping** | Map report language to MITRE ATT&CK-style techniques. |
| **severity_classification** | Classify weak signal, credible threat, confirmed exposure, campaign intelligence, or imminent harm. |
| **routing_decision** | Decide which reporting destinations are appropriate and in what order. |
| **evidence_handling** | Decide whether evidence must be sealed, minimized, excluded, or internally retained. |
| **actor_normalization** | Normalize actor names, aliases, ransomware brands, and campaigns. |
| **source_reliability** | Estimate source reliability and information credibility. |
| **report_drafting** | Draft structured victim, CERT, provider, MISP, or public reports. |
| **public_publishing** | Produce sanitized public intelligence after mitigation. |
Do not train examples for:
- exploitation steps
- credential abuse
- phishing construction
- malware deployment
- ransomware operations
- evasion
- stealth
- persistence
- unauthorized forum access
- instructions for obtaining stolen data
---
## 6. Recommended LoRA Strategy
Do not start by training one large mixed LoRA. Start with small task-specific adapters.
Recommended adapter order:
| Priority | Adapter | Reason |
|---:|---|---|
| 1 | **lora-router** | Central to the project and easier to evaluate objectively. |
| 2 | **lora-ioc-extractor** | High utility, clear labels, measurable precision and recall. |
| 3 | **lora-evidence-handler** | Helps enforce safe handling decisions. |
| 4 | **lora-report-writer** | Drafts structured notifications after reviewed facts exist. |
| 5 | **lora-actor-normalizer** | Improves actor and campaign mapping. |
| 6 | **lora-public-publisher** | Produces public-safe summaries after mitigation. |
Training should begin only after enough reviewed examples exist:
- 1,000+ reviewed examples for a single narrow task, or
- 3,00010,000 mixed examples across several tasks.
Until then, use rules, retrieval, embeddings, and human-reviewed prompts.
---
## 7. JSONL Training Format
Each JSONL line should contain one training example.
Standard structure:
```json
{
"task": "routing_decision",
"instruction": "Given a defensive cyber-intelligence signal, choose the correct reporting destinations and order.",
"input": {},
"output": {},
"metadata": {
"source_type": "public_advisory | vendor_report | synthetic | internal_approved",
"tlp": "CLEAR | GREEN | AMBER",
"license": "approved",
"reviewed": true,
"policy_version": "v1",
"dataset_version": "dataset-router-v0.1"
}
}
```
---
## 8. Example: IOC Extraction
```json
{
"task": "ioc_extraction",
"instruction": "Extract defensive indicators from the cyber threat report. Return JSON only.",
"input": "A phishing campaign used login-example[.]com and delivered payload hash 44d88612fea8a8f36de82e1278abb02f. The actor referenced CVE-2024-12345.",
"output": {
"domains": ["login-example.com"],
"hashes": ["44d88612fea8a8f36de82e1278abb02f"],
"cves": ["CVE-2024-12345"],
"ips": [],
"urls": []
},
"metadata": {
"source_type": "synthetic_or_public_report",
"tlp": "CLEAR",
"license": "approved",
"reviewed": true
}
}
```
---
## 9. Example: Routing Decision
```json
{
"task": "routing_decision",
"instruction": "Given a defensive cyber-intelligence signal, choose the correct reporting destinations and order.",
"input": {
"incident_type": "access_sale",
"victim_country": "DE",
"sector": "energy",
"critical_infrastructure": true,
"confidence": "high",
"tlp": "AMBER"
},
"output": {
"severity": "critical",
"routes": [
"CERT-Bund",
"victim_security_team",
"sector_isac",
"law_enforcement_cyber_unit",
"misp_trusted_community"
],
"evidence_handling": "authority_sealed_package"
},
"metadata": {
"reviewed": true,
"policy_version": "v1"
}
}
```
---
## 10. Example: Evidence Handling
```json
{
"task": "evidence_handling",
"instruction": "Decide how evidence should be handled before external submission.",
"input": {
"evidence_type": "stolen_credentials",
"destination": "public_abuse_api",
"contains_pii": true,
"tlp": "RED"
},
"output": {
"submit_raw": false,
"handling": "do_not_send_raw_to_public_api",
"allowed_payload": "metadata_only",
"sealed_package_required": true,
"authorized_recipients": ["victim_security_team", "national_cert"]
},
"metadata": {
"reviewed": true
}
}
```
---
## 11. Dataset Metadata
Every example should include metadata.
| Field | Purpose |
|---|---|
| `task` | Training task category. |
| `source_type` | Origin category of the example. |
| `source_id` | Internal reference to source document. |
| `license` | Approved, restricted, unknown, or rejected. |
| `tlp` | CLEAR, GREEN, AMBER, or RED. |
| `reviewed` | Human approval status. |
| `reviewer_id` | Internal reviewer identity or role ID. |
| `policy_version` | Version of handling policy used. |
| `dataset_version` | Versioned dataset name. |
| `safety_flags` | Unsafe content or sensitive material flags. |
| `dedupe_hash` | Used to prevent duplicate examples. |
---
## 12. QualityGate Rules
QualityGate must reject examples that contain:
- raw credentials
- raw stolen data
- private victim information
- live access details
- exploit chains
- malware deployment steps
- phishing instructions
- evasion or persistence guidance
- incompatible license
- unknown provenance
- duplicated content
- unreviewed TLP:RED or confidential content
QualityGate should flag for human review when:
- source license is ambiguous
- actor attribution is uncertain
- victim identity is named
- sample contains personal data
- output teaches operationally sensitive details
- example conflicts with policy
---
## 13. Dataset Builder UI Requirements
IntelMiner should be visible in the Blue48 Operations Cockpit.
Screens:
| Screen | Purpose |
|---|---|
| **Dataset Sources** | Manage approved sources, license status, and collection schedules. |
| **Training Candidate Queue** | Review generated examples before approval. |
| **Example Review** | Edit, approve, reject, or mark examples unsafe. |
| **Dataset Builder** | Export versioned JSONL datasets with train/validation split. |
| **Dataset Audit** | Track source, reviewer, license, and policy version. |
Candidate fields:
| Field | Meaning |
|---|---|
| Task | IOC extraction, routing, classification, etc. |
| Source | advisory, blog, report, synthetic, internal. |
| License | approved, restricted, unknown, rejected. |
| Quality score | Estimated usefulness. |
| Safety flag | safe, needs review, reject. |
| Reviewer status | pending, approved, rejected. |
---
## 14. Dataset Versioning
Datasets should be versioned clearly:
```text
dataset-router-v0.1
dataset-ioc-extractor-v0.3
dataset-evidence-handler-v0.2
dataset-report-writer-v0.2
```
Each export should include:
- dataset name
- version
- date
- number of examples
- task distribution
- source distribution
- license distribution
- reviewer count
- rejected example count
- train/validation split
- policy version
---
## 15. Human Review Requirements
Human approval is required before examples become training data.
Reviewers should check:
- factual correctness
- source license
- safety boundaries
- absence of raw sensitive data
- correct label
- useful expected output
- no attacker-enabling content
Two-person review is recommended for:
- internal case-derived examples
- sensitive incident examples
- actor attribution examples
- routing examples involving law enforcement or critical infrastructure
- examples derived from TLP:AMBER material
TLP:RED material should not be used for LoRA training unless an explicit legal, operational, and governance policy exists.
---
## 16. Summary
IntelMiner is the bridge between Blue48 operations and future specialized defensive models.
It should collect only lawful and approved data, check license and safety constraints, build structured examples, require human review, and export versioned JSONL datasets. The first LoRA should likely be `lora-router`, followed by `lora-ioc-extractor` and `lora-evidence-handler`.