Multi-Model Adversarial Review
How Bastion uses independent LLMs to catch what any single model misses — not just at proposal time, but at every layer where AI judgment is involved.
The Problem with Single-Model Security
The Bastion composition engine uses an LLM to discover attack vectors and composite chains. This works because LLMs can reason about causal relationships between vulnerabilities. But every LLM has systematic blind spots:
- Training data gaps — patterns absent from the training corpus are invisible
- Reasoning biases — consistent tendencies to decompose problems in certain ways
- Attention patterns — some code structures get more scrutiny than others
- Severity calibration — models disagree on what constitutes CRITICAL vs. MEDIUM
If a single model both discovers AND reviews, its blind spots are self-reinforcing. A bias that causes a miss also causes the miss to go undetected. The model doesn't know what it doesn't know.
Adversarial review at the proposal stage catches composition errors — chains that don't actually work, severity miscalibrations, false positives. But it does NOT catch discovery misses — primitives that should exist but were never proposed, because the discovering model has a systematic blind spot in that area.
The adversarial review must cover every layer where AI judgment is involved.
Where AI Judgment Is Involved
| Layer | AI Role | What a blind spot misses |
|---|---|---|
| Layer 0: Intelligence | Sweep sources, classify findings | A finding is dismissed as DUPLICATE when it's actually a new variant |
| Layer 1: Deterministic | None — mechanical | N/A (no AI judgment) |
| Layer 2: Domain Auditors | Review code against checklists | A vulnerability pattern not on the checklist is never flagged |
| Layer 2.5: Composition | Reason about cross-domain chains | A valid chain is never proposed because the model doesn't see the causal link |
| Layer 3: Automation | Generate artifacts from accepted proposals | A semgrep rule has a pattern gap that misses a variant |
Layer 1 (deterministic) and Layer 2 (human review) are not affected. Everything else is.
The Multi-Model Architecture
Principle: Independent Discovery, Adversarial Review, Union of Findings
No model reviews its own work. Every model runs independently. The human sees the union.
Layer 0: Intelligence Sweep
Each model sweeps the same source catalog independently and classifies findings.
What multi-model catches:
- Model A dismisses a finding as DUPLICATE; Model B classifies it as VARIANT
- Model A misses a relevant finding in a source; Model B catches it
- Model A extracts different vulnerability patterns from the same audit report
Implementation: Run intelligence-sync with each model. Diff the outputs. Findings classified differently across models get escalated for human review.
Layer 2: Domain Audits
Each model runs domain-specific code review independently with the same checklist.
What multi-model catches:
- Model A considers a code pattern safe; Model B flags it
- Model A focuses on the explicit check and misses an implicit assumption; Model B catches the assumption gap
- Model A's code reading comprehension misses a subtle control flow path
Implementation: Run each domain auditor (authorization, arithmetic, temporal, state) with each model against the same source files. Unique-to-one-model findings are the highest-value output — they're the blind spots.
Layer 2.5: Composition
Each model reads all domain vectors and independently proposes composite attack chains.
What multi-model catches:
- Model A composes temporal+auth but never considers arithmetic+state; Model B finds it
- Model A stops at 2-vector chains; Model B finds a 3-vector chain
- Model A's causal reasoning misses a data flow path that Model B traces
Implementation: Run composition-auditor with each model. The union of proposals is strictly larger than any single model's output. Duplicates across models are high-confidence (multiple independent reasoners converged on the same chain).
Layer 3: Artifact Generation
After human accepts a proposal, the generated artifacts (semgrep rules, test stubs) are reviewed by a different model.
What multi-model catches:
- A semgrep rule pattern that's too narrow (misses variants)
- A test stub that doesn't actually test the attack path
- A mitigation pattern that has its own vulnerability
Implementation: Model A generates artifacts; Model B reviews them before they enter the deterministic layer.
Confidence Scoring
When multiple models independently analyze the same scope, their agreement (or disagreement) is a signal:
| Finding | Claude | Gemini | Codex | Confidence | Action |
|---|---|---|---|---|---|
| AV-AUTH-NEW | FOUND | FOUND | FOUND | High | Fast-track to human review |
| AV-T-NEW | FOUND | FOUND | — | Medium | Standard review |
| AV-C-NEW | FOUND | — | — | Low (or novel) | Investigate — blind spot or false positive? |
| AV-S-NEW | — | FOUND | — | Low (or novel) | Investigate — blind spot or false positive? |
Single-model findings are NOT automatically lower priority. They may be the most valuable — a genuine blind spot that only one model catches. But they warrant deeper investigation before acceptance.
The Adversarial Challenge Protocol
When Model B challenges Model A's proposal, it answers these specific questions:
For Domain Vector Proposals
- Is this a real vulnerability? Can you construct a concrete exploit scenario, or is this theoretical?
- Is the severity correct? What's the actual worst-case impact? Is it overstated or understated?
- Is the mitigation sound? Does the proposed mitigation actually prevent the attack, or does it introduce a new weakness?
- What's missing from this domain? Given the code you see, what vulnerabilities did the proposing model NOT flag?
For Composite Vector Proposals
- Is this chain physically possible? Trace the data flow in the actual code — can an attacker actually move from step 1 to step 2?
- Is the composition type correct? Is this really a CHAIN (A enables B), or is it two independent issues mislabeled as related?
- Are there simpler explanations? Could this be a single-domain issue that doesn't require composition?
- What compositions were NOT proposed? Given these primitives, what cross-domain chains did the proposing model miss?
For the Overall Framework
- What security domains are missing? Are there vulnerability categories that don't fit authorization, arithmetic, temporal, state, or composition?
- What source categories are missing? Are there classes of intelligence source not represented in the catalog?
- Where does the deterministic layer have gaps? What types of vulnerabilities can't be caught by semgrep rules or DAML tests?
- Where does the process model fail? Under what conditions does the discovery→review→memorialization loop break down?
Implementation: How to Run Multi-Model Review
One-Time Framework Review
Give each model the complete Bastion documentation and ask it to find holes.
Input for each model:
website/docs/architecture.md— Overall system designwebsite/docs/layers.md— Layer details and agent architecturewebsite/docs/composition.md— Compositional learning mechanismwebsite/docs/adversarial-review.md— This document (meta-review)agents/AGENT_COORDINATION.md— Agent ownership and coordinationagents/composition-auditor.md— Composition agent definitionvectors/examples/daml-common.yaml— Reference vector examplesvectors/examples/composition.yaml— Reference composite examplessemgrep/daml-security.yaml— Static analysis rules
Prompt for framework review:
You are an adversarial security reviewer. Your job is to find weaknesses
in this security framework's design — not to validate it.
Read all provided documents. Then answer:
1. PROCESS MODEL GAPS: Where does the discovery→review→memorialization
pipeline fail? What types of vulnerabilities slip through?
2. DOMAIN COVERAGE GAPS: What vulnerability categories exist in
DAML/Canton smart contracts that are NOT covered by the 5 domains
(authorization, arithmetic, temporal, state, composition)?
3. COMPOSITION BLIND SPOTS: What types of cross-domain attack chains
does the composition auditor's heuristic list miss? What
composition patterns are NOT in the reference examples?
4. DETERMINISTIC LAYER WEAKNESSES: What types of findings CANNOT be
memorialized as semgrep rules or DAML tests? How does the system
handle vulnerabilities that resist deterministic checking?
5. INTELLIGENCE PIPELINE GAPS: What categories of security
intelligence source are missing? What types of findings would
those sources surface that current sources don't?
6. ADVERSARIAL REVIEW GAPS: How could this multi-model review
process itself be gamed or fail? What are WE not thinking about?
Be specific. Name the gap, explain why it matters, and propose
how to close it. Theoretical concerns without concrete impact
are not useful.
Ongoing Per-Review Multi-Model Protocol
For each security review cycle:
Step 1: Parallel independent discovery
- Run intelligence-sync, domain auditors, and composition auditor on each model
- Each produces its own set of proposals
Step 2: Cross-model challenge
- Each model's proposals are reviewed by a different model
- Challenger answers the adversarial questions above
- Disagreements are flagged
Step 3: Merge and triage
- Union of all unique findings
- Multi-model agreement findings → fast-track
- Single-model findings → investigate (could be blind spot or false positive)
- Cross-model disagreements → human arbitrates
Step 4: Human review
- Human sees: all proposals, all challenges, confidence scores
- Human makes accept/reject/revise decision
- Accepted findings →
/integrate-vector→ deterministic governance
What This Costs
Multi-model review multiplies the AI compute by the number of models. For Bastion, the cost is bounded:
| Activity | Runs per cycle | Models | Total runs |
|---|---|---|---|
| Intelligence sync | 1 | 3 | 3 |
| Domain auditors (4) | 4 | 3 | 12 |
| Composition auditor | 1 | 3 | 3 |
| Cross-model challenge | ~15 proposals | 1 each | ~15 |
| Total | ~33 agent runs |
This is a weekly or per-release cost, not per-commit. The deterministic layer (semgrep, tests, coverage) runs on every commit at zero AI cost. The multi-model review is the investment that makes the deterministic layer grow correctly.
Why Not Just Use the "Best" Model?
There is no best model for security review. Each model has demonstrated strengths:
| Model | Observed strength | Why it matters for security |
|---|---|---|
| Claude | Long-context reasoning, nuanced severity assessment | Can hold entire codebases in context for cross-file analysis |
| Gemini | Broad knowledge retrieval, aggressive pattern matching | Catches patterns from obscure sources Claude may not have trained on |
| Codex/GPT | Code generation fluency, exploit scenario construction | Better at constructing concrete PoC exploit paths |
The strengths don't matter as much as the differences. If all three models had the same blind spots, multi-model review would be useless. They don't. The value is in the disagreement.
Measuring Effectiveness
Track these metrics to evaluate whether multi-model review is finding real value:
| Metric | What it measures | Target |
|---|---|---|
| Unique-to-one-model findings | Blind spot detection rate | At least 10% of findings should be unique to one model |
| Cross-model disagreement rate | How often models disagree on classification | 15-30% (too low = models are too similar; too high = noise) |
| Single-model findings accepted | Blind spots that were real | Above 30% of single-model findings should be accepted |
| Challenge-caught false positives | Adversarial review effectiveness | Challenges should catch at least 20% of false positives before human review |
| Framework review findings implemented | Design-level improvement | Each framework review should produce at least 2 actionable changes |