Files

m17hr1l e04c6c96d8 init: scaffold psyc — defensive CTI routing & evidence-sealing platform

Stage-1 vertical slice: Pydantic Case model, SQLAlchemy Core persistence,
URLhaus Scoutline fetcher, FastAPI/Jinja cockpit (cases list + detail),
flat Typer CLI, Result[T, E] type module, structlog config.
Architecture in docs/dossier.md; 12-fold style guide in docs/style.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-14 12:43:47 +02:00

11 KiB

Raw Permalink Blame History

Blue48 IntelMiner and LoRA Training Data Pipeline

Document type: Project record / technical concept
Scope: Lawful intelligence collection, training-data preparation, LoRA dataset format, quality gates, safety boundaries
Status: Draft v1

1. Purpose

IntelMiner is the Blue48 worker responsible for collecting lawful defensive cyber-intelligence and converting it into reviewed, license-safe, LoRA-ready training examples.

IntelMiner does not train models to hack. It prepares training data for defensive tasks such as indicator extraction, routing, severity classification, evidence handling, and safe report writing.

Core mission:

IntelMiner collects lawful defensive cyber-intelligence from approved online sources and transforms it into reviewed, license-safe, LoRA-ready JSONL examples for specialized defensive models.

2. What IntelMiner Should Learn From

Allowed source categories:

national CERT advisories
CISA, ENISA, NCSC, CERT-EU, BSI, ANSSI, and similar public advisories
CVE, NVD, and exploited-vulnerability catalogs
public vendor threat reports
public malware-analysis reports
public ransomware trend reports from lawful monitors
MISP events where the license and sharing group permit reuse
abuse.ch datasets where permitted
public IOCs and defensive detection content
public incident writeups
internally written reports approved for training
synthetic examples written by analysts

Restricted or excluded source categories:

raw stolen data
raw credentials
private victim communications
criminal-forum content obtained without authorization
confidential CTI provider content without training rights
TLP:RED material
material with unknown or incompatible license
content that teaches exploitation, persistence, credential abuse, ransomware operation, or evasion

3. IntelMiner Worker Chain

SourcePlanner
→ Collector
→ LicenseChecker
→ ContentParser
→ Chunker
→ Labeler
→ ExampleBuilder
→ QualityGate
→ ReviewerQueue
→ DatasetWriter

4. Worker Responsibilities

Worker	Responsibility
SourcePlanner	Defines approved sources, update schedules, license expectations, and collection priority.
Collector	Pulls data from APIs, RSS, advisories, STIX/TAXII, MISP, GitHub, PDFs, and public reports.
LicenseChecker	Determines whether the material may be used for training. Blocks unknown or restricted content.
ContentParser	Extracts text, IOCs, dates, actors, CVEs, TTPs, victim sectors, and source metadata.
Chunker	Splits long content into training-sized units while preserving context.
Labeler	Assigns task labels such as IOC extraction, routing, classification, report writing, and evidence handling.
ExampleBuilder	Converts chunks into instruction/input/output training examples.
QualityGate	Removes unsafe, duplicated, mislabeled, low-confidence, or license-problematic examples.
ReviewerQueue	Sends candidates to human reviewers. Nothing enters the final dataset without approval.
DatasetWriter	Exports approved examples as versioned JSONL datasets.

5. Training Tasks

The LoRA adapters should learn defensive operations only.

Task	Purpose
ioc_extraction	Extract domains, IPs, URLs, hashes, emails, wallets, CVEs, and file names.
ttp_mapping	Map report language to MITRE ATT&CK-style techniques.
severity_classification	Classify weak signal, credible threat, confirmed exposure, campaign intelligence, or imminent harm.
routing_decision	Decide which reporting destinations are appropriate and in what order.
evidence_handling	Decide whether evidence must be sealed, minimized, excluded, or internally retained.
actor_normalization	Normalize actor names, aliases, ransomware brands, and campaigns.
source_reliability	Estimate source reliability and information credibility.
report_drafting	Draft structured victim, CERT, provider, MISP, or public reports.
public_publishing	Produce sanitized public intelligence after mitigation.

Do not train examples for:

exploitation steps
credential abuse
phishing construction
malware deployment
ransomware operations
evasion
stealth
persistence
unauthorized forum access
instructions for obtaining stolen data

6. Recommended LoRA Strategy

Do not start by training one large mixed LoRA. Start with small task-specific adapters.

Recommended adapter order:

Priority	Adapter	Reason
1	lora-router	Central to the project and easier to evaluate objectively.
2	lora-ioc-extractor	High utility, clear labels, measurable precision and recall.
3	lora-evidence-handler	Helps enforce safe handling decisions.
4	lora-report-writer	Drafts structured notifications after reviewed facts exist.
5	lora-actor-normalizer	Improves actor and campaign mapping.
6	lora-public-publisher	Produces public-safe summaries after mitigation.

Training should begin only after enough reviewed examples exist:

1,000+ reviewed examples for a single narrow task, or
3,000–10,000 mixed examples across several tasks.

Until then, use rules, retrieval, embeddings, and human-reviewed prompts.

7. JSONL Training Format

Each JSONL line should contain one training example.

Standard structure:

{
  "task": "routing_decision",
  "instruction": "Given a defensive cyber-intelligence signal, choose the correct reporting destinations and order.",
  "input": {},
  "output": {},
  "metadata": {
    "source_type": "public_advisory | vendor_report | synthetic | internal_approved",
    "tlp": "CLEAR | GREEN | AMBER",
    "license": "approved",
    "reviewed": true,
    "policy_version": "v1",
    "dataset_version": "dataset-router-v0.1"
  }
}

8. Example: IOC Extraction

{
  "task": "ioc_extraction",
  "instruction": "Extract defensive indicators from the cyber threat report. Return JSON only.",
  "input": "A phishing campaign used login-example[.]com and delivered payload hash 44d88612fea8a8f36de82e1278abb02f. The actor referenced CVE-2024-12345.",
  "output": {
    "domains": ["login-example.com"],
    "hashes": ["44d88612fea8a8f36de82e1278abb02f"],
    "cves": ["CVE-2024-12345"],
    "ips": [],
    "urls": []
  },
  "metadata": {
    "source_type": "synthetic_or_public_report",
    "tlp": "CLEAR",
    "license": "approved",
    "reviewed": true
  }
}

9. Example: Routing Decision

{
  "task": "routing_decision",
  "instruction": "Given a defensive cyber-intelligence signal, choose the correct reporting destinations and order.",
  "input": {
    "incident_type": "access_sale",
    "victim_country": "DE",
    "sector": "energy",
    "critical_infrastructure": true,
    "confidence": "high",
    "tlp": "AMBER"
  },
  "output": {
    "severity": "critical",
    "routes": [
      "CERT-Bund",
      "victim_security_team",
      "sector_isac",
      "law_enforcement_cyber_unit",
      "misp_trusted_community"
    ],
    "evidence_handling": "authority_sealed_package"
  },
  "metadata": {
    "reviewed": true,
    "policy_version": "v1"
  }
}

10. Example: Evidence Handling

{
  "task": "evidence_handling",
  "instruction": "Decide how evidence should be handled before external submission.",
  "input": {
    "evidence_type": "stolen_credentials",
    "destination": "public_abuse_api",
    "contains_pii": true,
    "tlp": "RED"
  },
  "output": {
    "submit_raw": false,
    "handling": "do_not_send_raw_to_public_api",
    "allowed_payload": "metadata_only",
    "sealed_package_required": true,
    "authorized_recipients": ["victim_security_team", "national_cert"]
  },
  "metadata": {
    "reviewed": true
  }
}

11. Dataset Metadata

Every example should include metadata.

Field	Purpose
`task`	Training task category.
`source_type`	Origin category of the example.
`source_id`	Internal reference to source document.
`license`	Approved, restricted, unknown, or rejected.
`tlp`	CLEAR, GREEN, AMBER, or RED.
`reviewed`	Human approval status.
`reviewer_id`	Internal reviewer identity or role ID.
`policy_version`	Version of handling policy used.
`dataset_version`	Versioned dataset name.
`safety_flags`	Unsafe content or sensitive material flags.
`dedupe_hash`	Used to prevent duplicate examples.

12. QualityGate Rules

QualityGate must reject examples that contain:

raw credentials
raw stolen data
private victim information
live access details
exploit chains
malware deployment steps
phishing instructions
evasion or persistence guidance
incompatible license
unknown provenance
duplicated content
unreviewed TLP:RED or confidential content

QualityGate should flag for human review when:

source license is ambiguous
actor attribution is uncertain
victim identity is named
sample contains personal data
output teaches operationally sensitive details
example conflicts with policy

13. Dataset Builder UI Requirements

IntelMiner should be visible in the Blue48 Operations Cockpit.

Screens:

Screen	Purpose
Dataset Sources	Manage approved sources, license status, and collection schedules.
Training Candidate Queue	Review generated examples before approval.
Example Review	Edit, approve, reject, or mark examples unsafe.
Dataset Builder	Export versioned JSONL datasets with train/validation split.
Dataset Audit	Track source, reviewer, license, and policy version.

Candidate fields:

Field	Meaning
Task	IOC extraction, routing, classification, etc.
Source	advisory, blog, report, synthetic, internal.
License	approved, restricted, unknown, rejected.
Quality score	Estimated usefulness.
Safety flag	safe, needs review, reject.
Reviewer status	pending, approved, rejected.

14. Dataset Versioning

Datasets should be versioned clearly:

dataset-router-v0.1
dataset-ioc-extractor-v0.3
dataset-evidence-handler-v0.2
dataset-report-writer-v0.2

Each export should include:

dataset name
version
date
number of examples
task distribution
source distribution
license distribution
reviewer count
rejected example count
train/validation split
policy version

15. Human Review Requirements

Human approval is required before examples become training data.

Reviewers should check:

factual correctness
source license
safety boundaries
absence of raw sensitive data
correct label
useful expected output
no attacker-enabling content

Two-person review is recommended for:

internal case-derived examples
sensitive incident examples
actor attribution examples
routing examples involving law enforcement or critical infrastructure
examples derived from TLP:AMBER material

TLP:RED material should not be used for LoRA training unless an explicit legal, operational, and governance policy exists.

16. Summary

IntelMiner is the bridge between Blue48 operations and future specialized defensive models.

It should collect only lawful and approved data, check license and safety constraints, build structured examples, require human review, and export versioned JSONL datasets. The first LoRA should likely be lora-router, followed by lora-ioc-extractor and lora-evidence-handler.

11 KiB Raw Permalink Blame History Unescape Escape