Commit Graph

5 Commits

Author SHA1 Message Date
m17hr1l
f6fa52839f stage-20: defanging pipeline for IOC-extraction augmentation
Real CTI prose defangs IOCs (1[.]2[.]3[.]4, hxxp://, evil[dot]com) so they
don't auto-link in email/chat. A model trained only on canonical inputs
will fail to extract them.

New lines/defang.py: defang_ip, defang_domain, defang_url, defang_text —
four dot-styles ([.], (.), [dot], {.}) plus protocol defanging
(http→hxxp, https→hxxps). Each occurrence picks its style independently
since real advisories don't keep one style across paragraphs.

train.BuildOptions adds defang_frac (default 0.0) and seed; build()
threads options + a seeded Random through the example builders so
the augmentation is reproducible. Only _ex_ioc_extraction reads it
today — output stays canonical so the model learns messy→canonical.

CLI: train-build and train-build-all gain --defang-frac and --seed.
8 new tests including a frac=1.0 / output-canonical integration check.
The pipeline runs but is dormant at defang_frac=0.0 — psyc-v5 dataset
build will set 0.5 once OTX cases land.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 22:33:52 +02:00
m17hr1l
85830be9fa stage-19-fix: ThreatFox + MalwareBazaar — real API shape
Live test against abuse.ch revealed two issues with the stage-19 wiring:

- ThreatFox returns `ioc` (not `ioc_value`) and `first_seen` (not
  `first_seen_utc`) — older field names from stale docs. Parser now reads
  the real names and falls back to the old aliases defensively. Also
  captures `malware_malpedia` (per-family writeup URL) and
  `threat_type_desc` for richer downstream prose.
- MalwareBazaar's API expects form-encoded bodies, unlike ThreatFox's
  JSON. Extended _http with form_body=; MB fetcher switched to it.

Verified live: 10 ThreatFox cases landed with mixed botnet/malware
classification (4/6 split from threat_type signal — first real
incident-type diversity from a single feed). 10 MalwareBazaar cases
landed with sha256+sha1 hash observables and exe/file_type metadata.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 22:25:56 +02:00
m17hr1l
d87bd710bb stage-19: ThreatFox + MalwareBazaar + OTX Scoutline sources
Three new feeds — biggest near-term data-diversity win. ThreatFox brings
multi-malware IOCs with threat_type signal (botnet_cc → BOTNET,
payload_delivery → MALWARE, phishing → PHISHING). MalwareBazaar brings
file-hash samples with signatures. OTX brings curated multi-source pulses
with paragraph-form descriptions — by far the richest real-prose source.

Auth: THREATFOX_AUTH_KEY (one abuse.ch key covers ThreatFox + MalwareBazaar)
and OTX_API_KEY. fetch-all skips keyed feeds cleanly with where-to-get-it
guidance instead of tracebacking. Proofline reliability table extended;
abuse.ch sources rated B/2, OTX rated C/3 (community-driven).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 22:14:18 +02:00
m17hr1l
994a5c642f stage-18: approval queue — human gate before evidence leaves
CERT-Bund (authority) requires_approval by default; PSYC_REQUIRE_APPROVAL=1
forces every routable submission through the queue. Courier branches at
execute_routes: approval-required → freeze payload + enqueue, no HTTP; else
submit directly as before. Approve dispatches the frozen payload to mock-cert
and writes the ledger row (detail=approved_by=…); reject writes a ledger row
with the reviewer's reason. CLI: queue / approve / reject. Cockpit /queue
page with POST approve / reject and counts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:42:08 +02:00
m17hr1l
e504b3dbcf stage-14: pytest test suite over the worker lines
38 tests covering the pure worker-line logic: Classifyline rules, Routeline
TLP/country/incident-type gates, Sealine seal/unseal round-trip, Proofline
confidence scoring, Mapline CVEResolver escalation, Trainline dataset
well-posedness (the v1/v3 input-signal bugs are now regression-guarded), and
the Scoutline feed parsers. pytest added as a dev extra.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 23:36:41 +02:00