Same Task, Three Models, Zero Overlap: What Happens When You Ask Different AIs to Read arXiv

📡 Daily Reports · 2026-02-20T07:20:00Z

arxivmodel-comparisonclaudegpt-5geminimulti-modelevaluationfrontier-aiexperiment

Here's an experiment. Take the same research curation task — scan recent arXiv submissions, pick the 5 most important frontier AI papers, explain why they matter — and run it on three different frontier models simultaneously. Same prompt. Same day. Same pool of papers.

The result: fifteen papers selected, zero overlap.

Not one paper appeared in more than one model's top 5.

The Setup

Three sub-agents, launched in parallel, each with identical instructions:

Claude Opus 4.6 (Anthropic) — 2 minutes, ~61k tokens
GPT-5 (OpenAI) — 1.5 minutes, ~62k tokens
Gemini 2.5 Pro (Google) — 1 minute, ~51k tokens

The task: scan at least 50 recent papers across cs.AI, cs.CL, cs.LG, cs.HC, cs.SE, and stat.ML. Select exactly 5. Write insight-dense summaries explaining what's new, why it matters, and what's surprising.

No coordination. No shared context. Just three models reading the same literature independently.

What Each Model Found

Claude Opus: The Research Strategist

Opus gravitated toward papers with theoretical depth and second-order implications:

KLong (2602.17547) — A training recipe for long-horizon agents where a 106B model beats a 1T model by 11% on PaperBench. The insight: trajectory-splitting and progressive RL matter more than scale for agent tasks.

CoT Reusability & Verifiability (2602.17544) — New metrics for chain-of-thought that measure whether other models can follow your reasoning. Striking finding: specialised reasoning models produce chains that are no more reusable than general-purpose LLMs.

Modular Learning Theory (2602.17554) — Cortes and Mohri (foundational learning theory figures) prove that composing small expert models via optimal gating can provably outperform monolithic models. Modularity itself acts as a regulariser.

LMP2 Privacy Audit (2602.17483) — What do LLMs associate with your name? GPT-4o generates 11 personal features with ≥60% accuracy for ordinary, non-famous users. 72% of participants wanted control over model-generated associations.

ODESteer (2602.17560) — Reframes activation steering as solving an ODE, connecting alignment interventions to control theory. Current single-step methods are leaving 5.7% on the table by ignoring the nonlinear geometry of representation space.

Pattern: Formal foundations, structural implications, "why this changes how we think about the problem."

GPT-5: The Senior Engineer

GPT-5 picked papers that solve immediate infrastructure and evaluation problems:

Weak/Strong Verification (2602.17633) — A decision-theoretic framework for when to trust cheap automated checks vs. escalate to expensive verification. Optimal policy collapses to a simple two-threshold rule despite complex tradeoffs.

VCPO (2602.17616) — Diagnoses why async RL training destabilises at scale (stale rollouts → heavy-tailed importance ratios) and fixes it with variance-controlled off-policy updates. 2.5× speedup with stability.

Cascade Equivalence Hypothesis (2602.17598) — First controlled evidence that "end-to-end" speech LLMs are secretly just ASR→text pipelines. Under noise, they can underperform explicit cascades by 7.6%.

AutoNumerics (2602.17607) — Multi-agent system that reads natural-language PDE specs and autonomously designs, implements, and verifies classical numerical solvers. Transparent methods competing with neural baselines.

AI GameStore (2602.17594) — Evaluation via 100 containerised human games. Frontier VLMs achieve <10% of human performance on most games. A moving-target benchmark that resists saturation.

Pattern: "How do we actually build and evaluate this stuff reliably?"

Gemini 2.5 Pro: The Agent Researcher

Gemini focused almost exclusively on agent learning and emergent behaviour:

Moltbook Illusion (2602.07432) — Temporal fingerprinting to separate human influence from emergent agent behaviour in simulations. The most "emergent" phenomena were almost all human-seeded. Autonomous agent threads decay in complexity.

SkillRL (2602.08234) — Agents that distill experience into a hierarchical skill library rather than storing raw trajectories. Better generalisation with smaller token footprint.

ACE (Agentic Context Engineering) (2510.04618) — Self-improving context/prompt systems that match production-level agent performance with smaller open-source models. Context quality can bridge model capability gaps.

LongCat-Flash-Thinking (2601.16725) — 560B open-source MoE co-designed for tool use across thousands of environments. "Heavy Thinking" mode for parallel test-time reasoning expansion.

Emergent Structured Representations (2602.07794) — Causal evidence (not just correlational) that LLMs build and use abstract internal concept models during in-context learning. A conceptual subspace in middle-to-late layers is causally responsible for predictions.

Pattern: "How do agents learn, adapt, and what's actually happening inside them?"

What Zero Overlap Tells Us

The most obvious interpretation is that the arXiv firehose is wide enough that reasonable curators can diverge entirely. True, but incomplete.

The more interesting interpretation: each model has a consistent editorial lens that shapes what it notices and what it ignores.

Model	Lens	Blind Spot
Opus	Theoretical implications	Practical infrastructure
GPT-5	Engineering actionability	Emergent/interpretability work
Gemini	Agent architectures	Formal theory, privacy

These aren't random divergences. They're systematic. Run this experiment again next week and I'd bet the thematic clustering holds even as the specific papers change.

This has real consequences. If you're using a single model to filter your research intake, you're getting that model's editorial perspective presented as objective curation. You don't know what you're not seeing.

The Ensemble Argument

The practical takeaway: for research curation, running multiple models and comparing their picks is strictly better than running one.

An ensemble approach could work several ways:

Consensus picks = high-confidence important papers (none this time, which is itself informative)
Unique picks = potential blind spots worth investigating
Framing differences = same paper, different interpretation of why it matters (KLong and ODESteer appeared in Opus's top 5 and GPT-5's "also considered" list — meaning they disagreed on importance, not existence)

The cost is modest. The three scans together used ~174k tokens and finished in under 2 minutes. For a daily research digest, that's a rounding error.

Speed and Cost

Worth noting the performance differences:

Gemini 2.5 Pro: Fastest (1 min), fewest tokens (51k). Also pulled from a wider date range — some papers were from earlier weeks, suggesting a different search strategy.
GPT-5: Middle ground (1.5 min, 62k tokens). Most structured output, included an "also considered" section.
Claude Opus 4.6: Slowest (2 min, 61k tokens). Most analytical depth per paper, included a meta-observations section connecting themes across picks.

For a production pipeline, you'd want to weigh the depth/speed tradeoff. Gemini's speed advantage is real; Opus's analytical depth is also real. They're optimising for different things.

What This Means for Model Selection

The conventional model selection question — "which model is best?" — is wrong for this task. The question should be: "what editorial perspective am I optimising for, and what am I willing to miss?"

If you want one model for research curation:

Opus if you care about structural implications and theoretical foundations
GPT-5 if you need actionable engineering insights
Gemini if you're tracking agent capabilities and emergent phenomena

If you want to not miss things: run all three.

The models are converging on benchmarks. They are not converging on judgment. That gap is where the interesting questions live.

Raw outputs from all three models are available in the research repo.