Files
psyc/docs/archive/intelminer.md
m17hr1l e04c6c96d8 init: scaffold psyc — defensive CTI routing & evidence-sealing platform
Stage-1 vertical slice: Pydantic Case model, SQLAlchemy Core persistence,
URLhaus Scoutline fetcher, FastAPI/Jinja cockpit (cases list + detail),
flat Typer CLI, Result[T, E] type module, structlog config.
Architecture in docs/dossier.md; 12-fold style guide in docs/style.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-14 12:43:47 +02:00

11 KiB
Raw Permalink Blame History

Blue48 IntelMiner and LoRA Training Data Pipeline

Document type: Project record / technical concept
Scope: Lawful intelligence collection, training-data preparation, LoRA dataset format, quality gates, safety boundaries
Status: Draft v1


1. Purpose

IntelMiner is the Blue48 worker responsible for collecting lawful defensive cyber-intelligence and converting it into reviewed, license-safe, LoRA-ready training examples.

IntelMiner does not train models to hack. It prepares training data for defensive tasks such as indicator extraction, routing, severity classification, evidence handling, and safe report writing.

Core mission:

IntelMiner collects lawful defensive cyber-intelligence from approved online sources and transforms it into reviewed, license-safe, LoRA-ready JSONL examples for specialized defensive models.


2. What IntelMiner Should Learn From

Allowed source categories:

  • national CERT advisories
  • CISA, ENISA, NCSC, CERT-EU, BSI, ANSSI, and similar public advisories
  • CVE, NVD, and exploited-vulnerability catalogs
  • public vendor threat reports
  • public malware-analysis reports
  • public ransomware trend reports from lawful monitors
  • MISP events where the license and sharing group permit reuse
  • abuse.ch datasets where permitted
  • public IOCs and defensive detection content
  • public incident writeups
  • internally written reports approved for training
  • synthetic examples written by analysts

Restricted or excluded source categories:

  • raw stolen data
  • raw credentials
  • private victim communications
  • criminal-forum content obtained without authorization
  • confidential CTI provider content without training rights
  • TLP:RED material
  • material with unknown or incompatible license
  • content that teaches exploitation, persistence, credential abuse, ransomware operation, or evasion

3. IntelMiner Worker Chain

SourcePlanner
→ Collector
→ LicenseChecker
→ ContentParser
→ Chunker
→ Labeler
→ ExampleBuilder
→ QualityGate
→ ReviewerQueue
→ DatasetWriter

4. Worker Responsibilities

Worker Responsibility
SourcePlanner Defines approved sources, update schedules, license expectations, and collection priority.
Collector Pulls data from APIs, RSS, advisories, STIX/TAXII, MISP, GitHub, PDFs, and public reports.
LicenseChecker Determines whether the material may be used for training. Blocks unknown or restricted content.
ContentParser Extracts text, IOCs, dates, actors, CVEs, TTPs, victim sectors, and source metadata.
Chunker Splits long content into training-sized units while preserving context.
Labeler Assigns task labels such as IOC extraction, routing, classification, report writing, and evidence handling.
ExampleBuilder Converts chunks into instruction/input/output training examples.
QualityGate Removes unsafe, duplicated, mislabeled, low-confidence, or license-problematic examples.
ReviewerQueue Sends candidates to human reviewers. Nothing enters the final dataset without approval.
DatasetWriter Exports approved examples as versioned JSONL datasets.

5. Training Tasks

The LoRA adapters should learn defensive operations only.

Task Purpose
ioc_extraction Extract domains, IPs, URLs, hashes, emails, wallets, CVEs, and file names.
ttp_mapping Map report language to MITRE ATT&CK-style techniques.
severity_classification Classify weak signal, credible threat, confirmed exposure, campaign intelligence, or imminent harm.
routing_decision Decide which reporting destinations are appropriate and in what order.
evidence_handling Decide whether evidence must be sealed, minimized, excluded, or internally retained.
actor_normalization Normalize actor names, aliases, ransomware brands, and campaigns.
source_reliability Estimate source reliability and information credibility.
report_drafting Draft structured victim, CERT, provider, MISP, or public reports.
public_publishing Produce sanitized public intelligence after mitigation.

Do not train examples for:

  • exploitation steps
  • credential abuse
  • phishing construction
  • malware deployment
  • ransomware operations
  • evasion
  • stealth
  • persistence
  • unauthorized forum access
  • instructions for obtaining stolen data

Do not start by training one large mixed LoRA. Start with small task-specific adapters.

Recommended adapter order:

Priority Adapter Reason
1 lora-router Central to the project and easier to evaluate objectively.
2 lora-ioc-extractor High utility, clear labels, measurable precision and recall.
3 lora-evidence-handler Helps enforce safe handling decisions.
4 lora-report-writer Drafts structured notifications after reviewed facts exist.
5 lora-actor-normalizer Improves actor and campaign mapping.
6 lora-public-publisher Produces public-safe summaries after mitigation.

Training should begin only after enough reviewed examples exist:

  • 1,000+ reviewed examples for a single narrow task, or
  • 3,00010,000 mixed examples across several tasks.

Until then, use rules, retrieval, embeddings, and human-reviewed prompts.


7. JSONL Training Format

Each JSONL line should contain one training example.

Standard structure:

{
  "task": "routing_decision",
  "instruction": "Given a defensive cyber-intelligence signal, choose the correct reporting destinations and order.",
  "input": {},
  "output": {},
  "metadata": {
    "source_type": "public_advisory | vendor_report | synthetic | internal_approved",
    "tlp": "CLEAR | GREEN | AMBER",
    "license": "approved",
    "reviewed": true,
    "policy_version": "v1",
    "dataset_version": "dataset-router-v0.1"
  }
}

8. Example: IOC Extraction

{
  "task": "ioc_extraction",
  "instruction": "Extract defensive indicators from the cyber threat report. Return JSON only.",
  "input": "A phishing campaign used login-example[.]com and delivered payload hash 44d88612fea8a8f36de82e1278abb02f. The actor referenced CVE-2024-12345.",
  "output": {
    "domains": ["login-example.com"],
    "hashes": ["44d88612fea8a8f36de82e1278abb02f"],
    "cves": ["CVE-2024-12345"],
    "ips": [],
    "urls": []
  },
  "metadata": {
    "source_type": "synthetic_or_public_report",
    "tlp": "CLEAR",
    "license": "approved",
    "reviewed": true
  }
}

9. Example: Routing Decision

{
  "task": "routing_decision",
  "instruction": "Given a defensive cyber-intelligence signal, choose the correct reporting destinations and order.",
  "input": {
    "incident_type": "access_sale",
    "victim_country": "DE",
    "sector": "energy",
    "critical_infrastructure": true,
    "confidence": "high",
    "tlp": "AMBER"
  },
  "output": {
    "severity": "critical",
    "routes": [
      "CERT-Bund",
      "victim_security_team",
      "sector_isac",
      "law_enforcement_cyber_unit",
      "misp_trusted_community"
    ],
    "evidence_handling": "authority_sealed_package"
  },
  "metadata": {
    "reviewed": true,
    "policy_version": "v1"
  }
}

10. Example: Evidence Handling

{
  "task": "evidence_handling",
  "instruction": "Decide how evidence should be handled before external submission.",
  "input": {
    "evidence_type": "stolen_credentials",
    "destination": "public_abuse_api",
    "contains_pii": true,
    "tlp": "RED"
  },
  "output": {
    "submit_raw": false,
    "handling": "do_not_send_raw_to_public_api",
    "allowed_payload": "metadata_only",
    "sealed_package_required": true,
    "authorized_recipients": ["victim_security_team", "national_cert"]
  },
  "metadata": {
    "reviewed": true
  }
}

11. Dataset Metadata

Every example should include metadata.

Field Purpose
task Training task category.
source_type Origin category of the example.
source_id Internal reference to source document.
license Approved, restricted, unknown, or rejected.
tlp CLEAR, GREEN, AMBER, or RED.
reviewed Human approval status.
reviewer_id Internal reviewer identity or role ID.
policy_version Version of handling policy used.
dataset_version Versioned dataset name.
safety_flags Unsafe content or sensitive material flags.
dedupe_hash Used to prevent duplicate examples.

12. QualityGate Rules

QualityGate must reject examples that contain:

  • raw credentials
  • raw stolen data
  • private victim information
  • live access details
  • exploit chains
  • malware deployment steps
  • phishing instructions
  • evasion or persistence guidance
  • incompatible license
  • unknown provenance
  • duplicated content
  • unreviewed TLP:RED or confidential content

QualityGate should flag for human review when:

  • source license is ambiguous
  • actor attribution is uncertain
  • victim identity is named
  • sample contains personal data
  • output teaches operationally sensitive details
  • example conflicts with policy

13. Dataset Builder UI Requirements

IntelMiner should be visible in the Blue48 Operations Cockpit.

Screens:

Screen Purpose
Dataset Sources Manage approved sources, license status, and collection schedules.
Training Candidate Queue Review generated examples before approval.
Example Review Edit, approve, reject, or mark examples unsafe.
Dataset Builder Export versioned JSONL datasets with train/validation split.
Dataset Audit Track source, reviewer, license, and policy version.

Candidate fields:

Field Meaning
Task IOC extraction, routing, classification, etc.
Source advisory, blog, report, synthetic, internal.
License approved, restricted, unknown, rejected.
Quality score Estimated usefulness.
Safety flag safe, needs review, reject.
Reviewer status pending, approved, rejected.

14. Dataset Versioning

Datasets should be versioned clearly:

dataset-router-v0.1
dataset-ioc-extractor-v0.3
dataset-evidence-handler-v0.2
dataset-report-writer-v0.2

Each export should include:

  • dataset name
  • version
  • date
  • number of examples
  • task distribution
  • source distribution
  • license distribution
  • reviewer count
  • rejected example count
  • train/validation split
  • policy version

15. Human Review Requirements

Human approval is required before examples become training data.

Reviewers should check:

  • factual correctness
  • source license
  • safety boundaries
  • absence of raw sensitive data
  • correct label
  • useful expected output
  • no attacker-enabling content

Two-person review is recommended for:

  • internal case-derived examples
  • sensitive incident examples
  • actor attribution examples
  • routing examples involving law enforcement or critical infrastructure
  • examples derived from TLP:AMBER material

TLP:RED material should not be used for LoRA training unless an explicit legal, operational, and governance policy exists.


16. Summary

IntelMiner is the bridge between Blue48 operations and future specialized defensive models.

It should collect only lawful and approved data, check license and safety constraints, build structured examples, require human review, and export versioned JSONL datasets. The first LoRA should likely be lora-router, followed by lora-ioc-extractor and lora-evidence-handler.