Daily arXiv Scan: March 28, 2026
Four-Model arXiv Consensus: March 28, 2026
80 papers scanned across cs.AI, cs.CL, cs.LG, cs.HC, cs.SE, stat.ML Models: Gemini 2.5 Pro, Claude Opus 4.6, Kimi K2, GPT-5
Consensus Picks (4/4 or 3/4 agreement)
Natural-Language Agent Harnesses (4/4)
All models selected this
The agent "harness" โ the control logic orchestrating tool use, state management, and action sequencing โ is typically buried in brittle controller code. This paper externalizes it as portable, editable natural-language specifications executed by a shared runtime.
- Gemini: Makes agent orchestration a "legible, auditable scientific object" โ crucial for AI governance
- Opus: Transforms agent behavior into version-controllable, A/B testable specifications
- Kimi: "Imagine if Dockerfiles were written in plain English and enforced by a theorem prover"
- GPT-5: Creates standardized agent control policies independent of runtime infrastructure
Why this matters: Governance prerequisite hiding in plain sight. If you can't inspect the harness, you can't audit agent behavior.
Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities (3/4)
Selected by Gemini, Opus, GPT-5
Introduces "reasoning safety" as distinct from content safety โ monitoring the chain-of-thought process itself for logical consistency, computational efficiency, and adversarial manipulation.
- Gemini: Names the "spooky feeling that an LLM's answer is correct for all the wrong reasons"
- Opus: Catches confident errors where self-uncertainty signals misleadingly low
- GPT-5: "Content filters miss the most dangerous failure mode: confident, wrong reasoning"
Why this matters: As models "show their work" via CoT, the reasoning chain becomes an attack surface independent of output quality.
Retraining as Approximate Bayesian Inference (3/4)
Selected by Gemini, Opus, GPT-5
Reframes model retraining from calendar schedules to decision theory. Introduces "learning debt" โ the divergence between deployed model beliefs and continuously updated ideal beliefs.
- Gemini: "Introduces the killer concept of 'learning debt'"
- Opus: Makes retraining "auditable and evidence-based โ rather than vibes-based"
- GPT-5: Replaces calendar-based retraining with principled, cost-minimization triggers
Why this matters: Governance gold. You can define retraining SLOs and quantify stale-model risk vs compute spend.
The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase (3/4)
Selected by Gemini, Opus, GPT-5
A complete architecture for autonomous software evolution: explicit specification surface, synthetic power-user testing at 1000x cadence, "unbeatable tests" that authors can't game, and automated drift control with pause gates.
- Gemini: "The most important paper in the list" โ a practical guide for self-improving systems
- Opus: "Preview of how all software development will work in 2-3 years"
- GPT-5: "The most coherent systems-level pattern for AI-native product engineering"
Why this matters: Inverts traditional development โ systems test themselves against specifications, humans intervene only at the intent layer.
Pair Picks (2/4 agreement)
Decidable By Construction: Design-Time Verification for Trustworthy AI
Selected by Gemini, GPT-5
Argues for building AI systems with provable properties baked into their mathematical structure, rather than testing trustworthiness post-hoc. Brings formal verification discipline to machine learning.
Cross-Model Disagreement as a Label-Free Correctness Signal
Selected by Opus, Kimi
When models disagree, they're probably wrong. Cross-model disagreement outperforms self-reported confidence for error detection, especially for confident errors that hurt most.
Unique Finds (1/4 agreement)
Kimi's distinctive picks:
- Self-Improvement of Large Language Models: Technical Overview โ First systematic survey of fully autonomous self-improvement loops
- Agent Factories for High Level Synthesis โ Software agents outperform humans at hardware design
- No Hard Negatives Required: Concept Centric Learning โ Compositional learning without dataset-specific negatives
Connecting Threads
The Inspection Imperative: Every consensus pick addresses making AI systems' internal processes legible and auditable. The field is realizing that black-box outputs aren't enough โ you need to inspect harness logic, belief states, specifications, and reasoning trajectories.
Systems-Level Safety Divergence: Safety signals emerge from relationships (cross-model disagreement, reasoning consistency) rather than individual components. This mirrors distributed systems thinking applied to AI.
Specification as Control Point: Multiple papers move governance to declarative specification layers above implementation. Control should operate at the level of intent and belief, not code and weights.
The Meta Turn: None of these advance model capabilities. They all build infrastructure around models โ the harnesses, monitoring, disagreement signals, retraining triggers, drift controls that make deployment trustworthy.
Statistical Baseline
- Overlap vs Chance: 4 papers at 3+ agreement (expected: 0.07), 6 at 2+ agreement (expected: 1.72)
- Surprise factor: 57x more consensus than expected by chance at 3+ agreement threshold
Recommended Reading (Ranked by Agreement)
- Natural-Language Agent Harnesses โ All models
- Beyond Content Safety: Real-Time Monitoring โ 3 models
- Retraining as Approximate Bayesian Inference โ 3 models
- The Kitchen Loop: User-Spec-Driven Development โ 3 models
- Decidable By Construction โ 2 models
- Cross-Model Disagreement as Correctness Signal โ 2 models
Methodology: Four frontier models (Gemini 2.5 Pro, Claude Opus 4.6, Kimi K2, GPT-5) independently selected their top 5 papers from 80 submissions across AI-relevant arXiv categories. Consensus emerges from convergent selection without coordination.