ADR-004: AI Validation Layer for Scanner Findings

Status: ✅ Accepted Date: 2026-03-30 Decision Makers: Tilak Kumar

Context

Traditional DAST (Dynamic Application Security Testing) scanners have notoriously high false positive rates — 30–60% is common in the industry. Every false positive wastes a security engineer's time and erodes trust in the tool.

ThreatWeaver's 59-agent scanner generates raw findings based on heuristics. Without a second-pass validation, many of these findings would be false positives: payloads that returned a 200 status code but weren't actually exploitable, or endpoints that looked vulnerable based on response patterns but weren't.

The options were to build better heuristics (time-consuming, brittle) or use an AI model to reason about evidence.

Decision

Add a Claude-powered AI validation layer between raw finding detection and final finding storage.

Every raw finding goes through an AI validation step before being surfaced to the user:

Agent detects a potential vulnerability (e.g., IDOR: user A can access user B's resource)
Evidence is assembled: request/response pairs, status codes, response diff, business impact
Claude is called with the evidence and asked to rate confidence (0–10) and provide reasoning
Threshold applied: findings below the confidence threshold are discarded or marked as FP
Validated findings are written to the Blackboard and stored in PostgreSQL

Claude models used:

Haiku: fast pre-filter for obvious false positives (cheap, low latency)
Sonnet: standard validation for most finding types
Opus: reserved for critical/complex chains (SSRF, auth bypass chains, BFLA)

Consequences

Positive:

False positive rate reduced from ~40% (industry average) to ~15% in testing
AI provides human-readable reasoning for each finding — developers understand why it's a vulnerability
AI can reason about business context (e.g., "this endpoint is expected to return data for any user" is not a BOLA)
Confidence scores give security engineers a triage priority signal
Opus audit of scan logic (used during development) caught systemic bugs before they shipped

Negative / Trade-offs:

Each scan incurs AI API costs (Claude API pricing per token)
AI validation adds latency per finding (100–800ms depending on model)
AI can be wrong — hallucinated "reasoning" for borderline cases must be reviewed
Prompt engineering for each agent type requires ongoing maintenance
AI API rate limits can slow scans when finding volume is high
Cost tracking is required to prevent runaway billing on large scans

Alternatives Considered

Option	Why Rejected
Better heuristics only	Heuristics are brittle — every new application framework requires new rules. High ongoing maintenance. FP rate remains high.
Human review of all findings	Doesn't scale. 200+ raw findings per scan × multiple scans per day = unsustainable analyst workload.
Third-party AI validation (e.g., OpenAI)	Claude has superior instruction-following for structured JSON output and security reasoning. Anthropic's model cards align with responsible security use.
Rule-based FP suppression lists	Covers known patterns but misses novel FPs. Requires constant updating.
No validation (ship everything)	Tested during early rounds. Security engineers rejected the tool due to noise. Trust recovery is very hard once lost.

Context​

Decision​

Consequences​

Alternatives Considered​

Context

Decision

Consequences

Alternatives Considered