HomeWhy arbitrIQ How It Works Conceptual FrameworkPractical Walkthrough More Use CasesFor ConsultantsPricingFAQ Get Started

The core principle: every argument must survive opposition.

arbitrIQ does not generate a single answer. It instantiates a structured adversarial process in which propositions are defended, attacked, and evaluated under governed conditions.

This is not prompting a model to "think of counterarguments." It is a multi-agent protocol where independent models occupy distinct epistemic roles, are forbidden from converging prematurely, and produce a complete evidentiary record.

The fundamental asymmetry in decision-making is that confirming evidence is easy to generate and disconfirming evidence is cognitively expensive. arbitrIQ inverts this: opposition is the default, and agreement must be earned.
The tradition of structured debate

Intellectual foundations

arbitrIQ's architecture draws on convergent findings from decision science, intelligence analysis, epistemology, and machine learning research.

Adversarial collaboration

When researchers with opposing hypotheses design experiments together, they produce stronger evidence than either would alone. The disagreement becomes a methodological asset, not a social liability.

arbitrIQ operationalizes this principle: the Advocate and Opposition are structurally prevented from converging, forcing genuine engagement with the strongest form of each position.

Kahneman, D. (2003). "Experiences of Collaborative Research." American Psychologist, 58(9), 723–730. Mellers, B. et al. (2001). "Do Frequency Representations Eliminate Conjunction Effects?" Psychological Science, 12(4), 269–275.

Structured Analytic Techniques

Intelligence agencies developed formalized methods — devil's advocacy, Team A/Team B analysis, Analysis of Competing Hypotheses — precisely because unstructured expert judgment systematically underperforms structured dissent under conditions of complexity and uncertainty.

arbitrIQ's dimension-by-dimension debate is a computational implementation of these techniques, applied at a scale and speed that human teams cannot sustain consistently.

Heuer, R.J. (1999). Psychology of Intelligence Analysis. CIA Center for the Study of Intelligence. U.S. Government (2009). A Tradecraft Primer: Structured Analytic Techniques for Improving Intelligence Analysis.

The Diversity Prediction Theorem

Collective error equals average individual error minus prediction diversity. This is not a heuristic — it is a mathematical identity. Aggregate accuracy improves with diversity of judgment, even when individual judges are imperfect.

arbitrIQ exploits this directly: independently-trained models with different architectures, training data, and reasoning patterns produce structurally diverse assessments. The ensemble is provably more accurate than any single member.

Page, S.E. (2007). The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies. Princeton University Press. Surowiecki, J. (2004). The Wisdom of Crowds.

Sycophancy and confirmation bias in LLMs

Large language models exhibit systematic sycophantic behavior: they tend to agree with the user's stated position, even when that position is wrong. Single-model interactions amplify the user's existing priors rather than challenging them.

arbitrIQ's adversarial structure is specifically designed to defeat this failure mode. The Opposition agent has no access to the user's preferred outcome — its mandate is to find weaknesses regardless of the user's expectations.

Perez, E. et al. (2023). "Discovering Language Model Behaviors with Model-Written Evaluations." Findings of ACL. Sharma, M. et al. (2024). "Towards Understanding Sycophancy in Language Models." ICLR 2024.

AI safety via debate

Irving, Christiano, and Amodei (2018) proposed that AI systems can be made more truthful by having them debate each other under human judgment. Their key insight: it is easier for a human to judge a debate than to find the truth independently. A strong debater cannot win by lying if the opponent can expose the lie.

arbitrIQ applies this principle to strategic decision-making. The decision-maker doesn't need to be an expert in every dimension — they need to see the strongest arguments from both sides and assess which survived challenge.

Irving, G., Christiano, P. & Amodei, D. (2018). "AI Safety via Debate." arXiv:1805.00899.

The agent architecture

Four specialized roles, each with a distinct epistemic mandate. No agent has a complete view — the architecture enforces the division of cognitive labor.

D

Director

Ingests uploaded documents and contextual data. Decomposes the strategic question into specific dimensions — each representing a distinct analytical axis that requires independent examination. After all debates conclude, the Director synthesizes the evaluator reports and debate transcripts into a unified executive report.

Mandate: scope the inquiry, ensure completeness, and produce the final synthesis. The Director never participates in the debate itself.
A

Advocate

Constructs the strongest possible case in favor of the proposition, drawing on uploaded evidence, web research, and structured reasoning. Must respond substantively to every challenge from the Opposition — cannot concede without providing counter-evidence.

Mandate: defend the proposition at its strongest, not at its most convenient. Steelmanning, not strawmanning, the affirmative case.
O

Opposition

Systematically attacks the proposition. Surfaces counter-evidence, identifies hidden assumptions, stress-tests financial projections, and exposes risks the Advocate has not addressed. Structurally prevented from agreeing to disagree.

Mandate: find genuine weaknesses. The Opposition succeeds when it forces the Advocate to modify, qualify, or abandon claims — not when it generates rhetorical noise.
E

Evaluator

Intervenes once per dimension, after all debate turns are complete. Assesses argument quality, evidence strength, logical coherence, and the degree of genuine engagement between sides. Produces a structured score and identifies what remains unresolved.

Mandate: impartial adjudication. The Evaluator is a different model from both debaters, ensuring no systematic alignment with either position.

The debate protocol

For each dimension, the protocol proceeds in three phases. The iterative cycle is between Advocate and Opposition only — the Evaluator intervenes once, after the debate concludes.

Phase 1 · Planning
D
Director decomposes the question into dimensions
Each dimension scopes a specific analytical axis — e.g., financial viability, regulatory risk, competitive dynamics.
Phase 2 · Iterative contradiction (2–10 turns)
Advocate defends
Opposition challenges
Cycle repeats — each turn deepens, responds to prior arguments, and narrows the space of genuine disagreement
Phase 3 · Evaluation (once per dimension)
E
Evaluator scores the completed debate
Assesses argument quality, evidence strength, logical coherence. Identifies what was resolved and what remains uncertain.
Final · Synthesis (once, across all dimensions)
D
Director synthesizes all evaluator reports and transcripts
Produces the decision-ready executive report with integrated scoring, rationale, and explicit uncertainty mapping.

The critical design choice: the Advocate and Opposition iterate without premature adjudication. The Evaluator's assessment is based on the full exchange, not on partial snapshots — ensuring that late-emerging arguments and concessions are weighted appropriately.

Model diversity as an epistemic resource

arbitrIQ assigns different independently-trained models to each agent role. This is not a cosmetic choice — it is a direct application of the diversity prediction theorem: ensemble accuracy improves with genuine diversity of reasoning, even when individual reasoners are imperfect.

Models from Anthropic, OpenAI, and Google differ in training data, RLHF processes, architectural choices, and failure modes. When forced into adversarial interaction, these differences produce debates of substantially higher quality than any single-model self-critique.

Access to all frontier models from Anthropic, OpenAI, and Google

Sycophancy cancellation

LLMs are trained to agree with users. In arbitrIQ, the Opposition has no access to the user's preferred outcome — its mandate is structural, not social. The debate framework defeats pleaser bias by design.

Correlated error reduction

Models trained on different data with different objectives produce different errors. Under adversarial pressure, errors that would survive single-model review are exposed by the opposing model's distinct failure profile.

Reasoning style diversity

Different model families exhibit distinct reasoning patterns — some favor quantitative analysis, others narrative coherence, others risk-centric framing. Structured opposition surfaces these differences as analytical assets rather than noise.

Forced grounding depth

When challenged by an adversary drawing on live web search, models cannot rely on training-data priors alone. The iterative cycle forces progressively deeper engagement with current evidence.

Configurable governance

Not all decisions require the same depth of examination. arbitrIQ allows leaders to calibrate the breadth, depth, and evidentiary standard of each analysis — matching governance effort to decision stakes.

Governance profiles chart

Dimensions

1–7

Debate Depth

2–10

Model Diversity

Low–High

Web Search

On/Off

Report Detail

Concise–Verbose

A rapid triage — 2 dimensions, 3 debate turns — takes under 10 minutes and is appropriate for early-stage scoping. A full governance-grade analysis — 7 dimensions, 10 debate turns with web grounding — produces 60+ pages of structured analysis suitable for board presentation or client deliverables. The decision-maker chooses the appropriate level.

Why architecture matters more than model selection

The most common question we receive is "which model do you use?" The answer is that the question is underdetermined — the architecture of the interaction matters more than the choice of any single model.

A frontier model in a single-turn interaction will produce sycophantic, overconfident output. The same model, placed in a structured adversarial role with a different model challenging its claims, produces qualitatively different reasoning.

The insight is simple but consequential: the epistemic quality of AI output is not a property of the model — it is a property of the protocol. arbitrIQ's contribution is the protocol.
Modern strategic decision-making

Go deeper

Ready to see structured contradiction in action?

Run your first strategic decision through arbitrIQ.

Launch arbitrIQ

Analysis informs decisions. Governance protects them.