Daily arXiv Scan: March 20, 2026
4-Model arXiv Comparison: March 20, 2026
80 papers scanned across cs.AI, cs.CL, cs.LG, cs.HC, cs.SE, stat.ML Models: Kimi K2, Claude Opus 4.6, Gemini 2.5 Pro, GPT-5
Consensus Picks (3+ Models)
Constitutive vs. Corrective: A Causal Taxonomy of Human Runtime Involvement in AI Systems
arXiv:2603.19213 โ All 4 models
- Opus: Frames this as critical conceptual infrastructure that "quietly reshapes how entire industries think about compliance and design." The causal distinction between constitutive (human action is part of the decision-producing chain) vs. corrective (human monitors and can intervene) cuts through regulatory ambiguity.
- GPT-5: Emphasizes the operational impact: "enables deployment gates, contract enforcement, forensic traceability" and makes "human oversight" testable rather than just a compliance checkbox.
- Gemini: Calls it "essential reading" that provides "sharp, functional vocabulary to design and regulate socio-technical systems" with enormous legal and ethical implications.
- Kimi: Highlights the policy relevance: "product teams now have a design language to prove which human-invocation pattern they use, auditors get a verifiable graph."
Regret Bounds for Competitive Resource Allocation with Endogenous Costs
arXiv:2603.18999 โ Kimi, Opus, Gemini
- Kimi: Focuses on the transparency angle: the "cost-revealing" update rule is "an algorithmic instantiation of radical transparency as an incentive device."
- Opus: Emphasizes the structural implications for multi-agent AI systems where "one agent's resource consumption affects others' performance."
- Gemini: Positions it as "foundational for designing future multi-agent systems, decentralized AI economies, and even governing the training of foundation models."
Pair Picks (2 Models)
Behavioral Fingerprints for LLM Endpoint Stability and Identity
arXiv:2603.19022 โ Opus, GPT-5
- Opus: Frames as the beginning of "behavioral observability" for AI systems, addressing how to know if your model endpoint has changed underneath you.
- GPT-5: Emphasizes operational urgency: "if you run or depend on LLM endpoints, this is table stakes" and expects it to become standard practice.
Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference
arXiv:2603.19025 โ Gemini, GPT-5
- Gemini: Calls it "a game-changer" for creating "auditable AI services" and enabling "decentralized AI marketplaces where participants can trust the computation."
- GPT-5: Highlights feasibility: "the most actionable proposal for verifiable inference that could plausibly make it into production APIs in the next 12-18 months."
SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels
arXiv:2603.19173 โ Kimi, GPT-5
- Kimi: Emphasizes the paradigm shift: "turning kernel-generation from a relative sport into an absolute one" using distance-to-optimal rewards.
- GPT-5: Focuses on practical impact for agentic optimizers: "adopt a headroom-to-roofline reward yesterday" as the new standard for kernel optimization.
Online Learning and Equilibrium Computation with Ranking Feedback
arXiv:2603.19221 โ Kimi, Opus
- Kimi: Calls it "a sleeper hit" that shows ranking feedback is sufficient for convergence, making "expensive, high-friction data pipelines look antiquated."
- Opus: Emphasizes the gap it fills: "the difference between idealized ML theory and real socio-technical deployments" where you get rankings, not utilities.
Connecting Threads
The Runtime Trust Gap
Three papers (human involvement taxonomy, behavioral fingerprinting, security awareness) converge on a fundamental challenge: you cannot take AI system behavior at face value during deployment. Whether it's humans who think they're overseeing but aren't causally involved, endpoints that silently change identity, or agents that can't verify their own security context โ there's a pervasive gap between assumed and actual system behavior at runtime.
Information Structure Determines Everything
The ranking feedback and endogenous costs papers demonstrate that what information is available, and in what form, fundamentally determines achievable outcomes. This isn't a minor implementation detail โ the structure of feedback mechanisms (ordinal vs. cardinal, exogenous vs. endogenous) determines regret bounds, equilibrium properties, and convergence guarantees.
From Monoliths to Composites
Multiple papers signal a shift away from end-to-end monolithic models toward well-designed composite systems. Box Maze advocates for structured reasoning architectures, the causal taxonomy provides language for human-AI composites, and lightweight proofs enable verifiable model-cryptography composites.
Verifiability Crisis
As models become more capable and opaque, we're losing our ability to trust them. This is being addressed at multiple levels: system-level (did you run the right model?), process-level (is the reasoning sound?), and cognitive-level (is it actually reasoning or pattern-matching?).
Statistical Baseline
Overlap Analysis:
- Total unique papers selected: 11
- Papers at 3+ agreement: 2 (expected by chance: 0.07)
- Papers at 2+ agreement: 6 (expected by chance: 1.72)
The strong consensus on the human involvement taxonomy (all 4 models) and significant 3-model agreement on resource allocation suggests these represent genuinely important developments rather than random selection artifacts.
Recommended Reading (by Agreement)
- Constitutive vs. Corrective: A Causal Taxonomy of Human Runtime Involvement in AI Systems (4/4 models) โ Essential for anyone designing human-AI systems or working in AI governance
- Regret Bounds for Competitive Resource Allocation with Endogenous Costs (3/4 models) โ Critical theory for multi-agent systems
- Behavioral Fingerprints for LLM Endpoint Stability and Identity (2/4 models) โ Practical monitoring for production LLM deployments
- Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference (2/4 models) โ Infrastructure for trustworthy AI services
Methodology: Four frontier models (Kimi K2, Claude Opus 4.6, Gemini 2.5 Pro, GPT-5) independently selected 5 papers each from 80 recent submissions across AI/ML venues. Analysis compares selection overlap against statistical baselines and synthesizes model-specific reasoning.