psyc/docs/archive/intelminer.md

# Blue48 IntelMiner and LoRA Training Data Pipeline

**Document type:** Project record / technical concept
**Scope:** Lawful intelligence collection, training-data preparation, LoRA dataset format, quality gates, safety boundaries
**Status:** Draft v1

---

## 1. Purpose

IntelMiner is the Blue48 worker responsible for collecting lawful defensive cyber-intelligence and converting it into reviewed, license-safe, LoRA-ready training examples.

IntelMiner does not train models to hack. It prepares training data for defensive tasks such as indicator extraction, routing, severity classification, evidence handling, and safe report writing.

Core mission:

> IntelMiner collects lawful defensive cyber-intelligence from approved online sources and transforms it into reviewed, license-safe, LoRA-ready JSONL examples for specialized defensive models.

---

## 2. What IntelMiner Should Learn From

Allowed source categories:

- national CERT advisories
- CISA, ENISA, NCSC, CERT-EU, BSI, ANSSI, and similar public advisories
- CVE, NVD, and exploited-vulnerability catalogs
- public vendor threat reports
- public malware-analysis reports
- public ransomware trend reports from lawful monitors
- MISP events where the license and sharing group permit reuse
- abuse.ch datasets where permitted
- public IOCs and defensive detection content
- public incident writeups
- internally written reports approved for training
- synthetic examples written by analysts

Restricted or excluded source categories:

- raw stolen data
- raw credentials
- private victim communications
- criminal-forum content obtained without authorization
- confidential CTI provider content without training rights
- TLP:RED material
- material with unknown or incompatible license
- content that teaches exploitation, persistence, credential abuse, ransomware operation, or evasion

---

## 3. IntelMiner Worker Chain

```text
SourcePlanner
→ Collector
→ LicenseChecker
→ ContentParser
→ Chunker
→ Labeler
→ ExampleBuilder
→ QualityGate
→ ReviewerQueue
→ DatasetWriter
```

---

## 4. Worker Responsibilities

| Worker | Responsibility |
|---|---|
| **SourcePlanner** | Defines approved sources, update schedules, license expectations, and collection priority. |
| **Collector** | Pulls data from APIs, RSS, advisories, STIX/TAXII, MISP, GitHub, PDFs, and public reports. |
| **LicenseChecker** | Determines whether the material may be used for training. Blocks unknown or restricted content. |
| **ContentParser** | Extracts text, IOCs, dates, actors, CVEs, TTPs, victim sectors, and source metadata. |
| **Chunker** | Splits long content into training-sized units while preserving context. |
| **Labeler** | Assigns task labels such as IOC extraction, routing, classification, report writing, and evidence handling. |
| **ExampleBuilder** | Converts chunks into instruction/input/output training examples. |
| **QualityGate** | Removes unsafe, duplicated, mislabeled, low-confidence, or license-problematic examples. |
| **ReviewerQueue** | Sends candidates to human reviewers. Nothing enters the final dataset without approval. |
| **DatasetWriter** | Exports approved examples as versioned JSONL datasets. |

---

## 5. Training Tasks

The LoRA adapters should learn defensive operations only.

| Task | Purpose |
|---|---|
| **ioc_extraction** | Extract domains, IPs, URLs, hashes, emails, wallets, CVEs, and file names. |
| **ttp_mapping** | Map report language to MITRE ATT&CK-style techniques. |
| **severity_classification** | Classify weak signal, credible threat, confirmed exposure, campaign intelligence, or imminent harm. |
| **routing_decision** | Decide which reporting destinations are appropriate and in what order. |
| **evidence_handling** | Decide whether evidence must be sealed, minimized, excluded, or internally retained. |
| **actor_normalization** | Normalize actor names, aliases, ransomware brands, and campaigns. |
| **source_reliability** | Estimate source reliability and information credibility. |
| **report_drafting** | Draft structured victim, CERT, provider, MISP, or public reports. |
| **public_publishing** | Produce sanitized public intelligence after mitigation. |

Do not train examples for:

- exploitation steps
- credential abuse
- phishing construction
- malware deployment
- ransomware operations
- evasion
- stealth
- persistence
- unauthorized forum access
- instructions for obtaining stolen data

---

## 6. Recommended LoRA Strategy

Do not start by training one large mixed LoRA. Start with small task-specific adapters.

Recommended adapter order:

| Priority | Adapter | Reason |
|---:|---|---|
| 1 | **lora-router** | Central to the project and easier to evaluate objectively. |
| 2 | **lora-ioc-extractor** | High utility, clear labels, measurable precision and recall. |
| 3 | **lora-evidence-handler** | Helps enforce safe handling decisions. |
| 4 | **lora-report-writer** | Drafts structured notifications after reviewed facts exist. |
| 5 | **lora-actor-normalizer** | Improves actor and campaign mapping. |
| 6 | **lora-public-publisher** | Produces public-safe summaries after mitigation. |

Training should begin only after enough reviewed examples exist:

- 1,000+ reviewed examples for a single narrow task, or
- 3,000–10,000 mixed examples across several tasks.

Until then, use rules, retrieval, embeddings, and human-reviewed prompts.

---

## 7. JSONL Training Format

Each JSONL line should contain one training example.

Standard structure:

```json
{
  "task": "routing_decision",
  "instruction": "Given a defensive cyber-intelligence signal, choose the correct reporting destinations and order.",
  "input": {},
  "output": {},
  "metadata": {
    "source_type": "public_advisory | vendor_report | synthetic | internal_approved",
    "tlp": "CLEAR | GREEN | AMBER",
    "license": "approved",
    "reviewed": true,
    "policy_version": "v1",
    "dataset_version": "dataset-router-v0.1"
  }
}
```

---

## 8. Example: IOC Extraction

```json
{
  "task": "ioc_extraction",
  "instruction": "Extract defensive indicators from the cyber threat report. Return JSON only.",
  "input": "A phishing campaign used login-example[.]com and delivered payload hash 44d88612fea8a8f36de82e1278abb02f. The actor referenced CVE-2024-12345.",
  "output": {
    "domains": ["login-example.com"],
    "hashes": ["44d88612fea8a8f36de82e1278abb02f"],
    "cves": ["CVE-2024-12345"],
    "ips": [],
    "urls": []
  },
  "metadata": {
    "source_type": "synthetic_or_public_report",
    "tlp": "CLEAR",
    "license": "approved",
    "reviewed": true
  }
}
```

---

## 9. Example: Routing Decision

```json
{
  "task": "routing_decision",
  "instruction": "Given a defensive cyber-intelligence signal, choose the correct reporting destinations and order.",
  "input": {
    "incident_type": "access_sale",
    "victim_country": "DE",
    "sector": "energy",
    "critical_infrastructure": true,
    "confidence": "high",
    "tlp": "AMBER"
  },
  "output": {
    "severity": "critical",
    "routes": [
      "CERT-Bund",
      "victim_security_team",
      "sector_isac",
      "law_enforcement_cyber_unit",
      "misp_trusted_community"
    ],
    "evidence_handling": "authority_sealed_package"
  },
  "metadata": {
    "reviewed": true,
    "policy_version": "v1"
  }
}
```

---

## 10. Example: Evidence Handling

```json
{
  "task": "evidence_handling",
  "instruction": "Decide how evidence should be handled before external submission.",
  "input": {
    "evidence_type": "stolen_credentials",
    "destination": "public_abuse_api",
    "contains_pii": true,
    "tlp": "RED"
  },
  "output": {
    "submit_raw": false,
    "handling": "do_not_send_raw_to_public_api",
    "allowed_payload": "metadata_only",
    "sealed_package_required": true,
    "authorized_recipients": ["victim_security_team", "national_cert"]
  },
  "metadata": {
    "reviewed": true
  }
}
```

---

## 11. Dataset Metadata

Every example should include metadata.

| Field | Purpose |
|---|---|
| `task` | Training task category. |
| `source_type` | Origin category of the example. |
| `source_id` | Internal reference to source document. |
| `license` | Approved, restricted, unknown, or rejected. |
| `tlp` | CLEAR, GREEN, AMBER, or RED. |
| `reviewed` | Human approval status. |
| `reviewer_id` | Internal reviewer identity or role ID. |
| `policy_version` | Version of handling policy used. |
| `dataset_version` | Versioned dataset name. |
| `safety_flags` | Unsafe content or sensitive material flags. |
| `dedupe_hash` | Used to prevent duplicate examples. |

---

## 12. QualityGate Rules

QualityGate must reject examples that contain:

- raw credentials
- raw stolen data
- private victim information
- live access details
- exploit chains
- malware deployment steps
- phishing instructions
- evasion or persistence guidance
- incompatible license
- unknown provenance
- duplicated content
- unreviewed TLP:RED or confidential content

QualityGate should flag for human review when:

- source license is ambiguous
- actor attribution is uncertain
- victim identity is named
- sample contains personal data
- output teaches operationally sensitive details
- example conflicts with policy

---

## 13. Dataset Builder UI Requirements

IntelMiner should be visible in the Blue48 Operations Cockpit.

Screens:

| Screen | Purpose |
|---|---|
| **Dataset Sources** | Manage approved sources, license status, and collection schedules. |
| **Training Candidate Queue** | Review generated examples before approval. |
| **Example Review** | Edit, approve, reject, or mark examples unsafe. |
| **Dataset Builder** | Export versioned JSONL datasets with train/validation split. |
| **Dataset Audit** | Track source, reviewer, license, and policy version. |

Candidate fields:

| Field | Meaning |
|---|---|
| Task | IOC extraction, routing, classification, etc. |
| Source | advisory, blog, report, synthetic, internal. |
| License | approved, restricted, unknown, rejected. |
| Quality score | Estimated usefulness. |
| Safety flag | safe, needs review, reject. |
| Reviewer status | pending, approved, rejected. |

---

## 14. Dataset Versioning

Datasets should be versioned clearly:

```text
dataset-router-v0.1
dataset-ioc-extractor-v0.3
dataset-evidence-handler-v0.2
dataset-report-writer-v0.2
```

Each export should include:

- dataset name
- version
- date
- number of examples
- task distribution
- source distribution
- license distribution
- reviewer count
- rejected example count
- train/validation split
- policy version

---

## 15. Human Review Requirements

Human approval is required before examples become training data.

Reviewers should check:

- factual correctness
- source license
- safety boundaries
- absence of raw sensitive data
- correct label
- useful expected output
- no attacker-enabling content

Two-person review is recommended for:

- internal case-derived examples
- sensitive incident examples
- actor attribution examples
- routing examples involving law enforcement or critical infrastructure
- examples derived from TLP:AMBER material

TLP:RED material should not be used for LoRA training unless an explicit legal, operational, and governance policy exists.

---

## 16. Summary

IntelMiner is the bridge between Blue48 operations and future specialized defensive models.

It should collect only lawful and approved data, check license and safety constraints, build structured examples, require human review, and export versioned JSONL datasets. The first LoRA should likely be `lora-router`, followed by `lora-ioc-extractor` and `lora-evidence-handler`.