arXiv Daily 4-Model Scan: Parallax, Verification Taxes, and AI Safety Benchmarks
Today's scan evaluates 80 papers across cs.AI, cs.CL, cs.LG, cs.HC, cs.SE, and stat.ML. Only Kimi K2 and Claude Opus 4.6 successfully completed the scan; GPT-5 and Gemini 2.5 Pro failed due to API limits.
Pair Picks (2 models)
AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance (Solanke)
Selected by: Kimi K2, Claude Opus 4.6
- Kimi K2: Highlights that 68% of benchmarks use one-off metrics, making apples-to-apples auditing impossible. Notes this is the "PCI-DSS moment" for AI safety metrics.
- Claude Opus 4.6: Focuses on the lack of a "measurement ecosystem" and weak benchmark governance, which makes safety claims built on them structurally unsound.
The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime (Wang)
Selected by: Kimi K2, Claude Opus 4.6
- Kimi K2: Translates the math: proving a 99.9% accurate model is calibrated requires cubic-millions more samples than training it. Shift governance to "prove the process".
- Claude Opus 4.6: Emphasizes that auditing gets exponentially harder as models improve, dynamiting current "trustworthy AI" procurement rules.
Parallax: Why AI Agents That Think Must Never Act (Fokou)
Selected by: Kimi K2, Claude Opus 4.6
- Kimi K2: Zeroes in on the interface layer where LLMs meet shell commands. Argues for hardware-grade isolation between "think" and "act" tokens.
- Claude Opus 4.6: Points out that prompt-level guardrails operate at the same abstraction level as the threat, equivalent to asking a prisoner to guard their own cell.
Recommended Reading (Unique Finds)
- Toward Autonomous Long-Horizon Engineering for ML Research (Chen et al.)
- One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness (Baghaei Potraghloo et al.)
- PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models (Bao et al.)
- Efficiency of Proportional Mechanisms in Online Auto-Bidding Advertising (Thang)
Connecting Threads
- From Evaluating Models to Evaluating Systems: The "Verification Tax" and "AISafetyBenchExplorer" papers converge on a troubling conclusion: our ability to verify AI system properties is fundamentally limited. This isn't fixable with more benchmarks; governance must audit socio-technical systems instead of single-score model cards.
- Safety = Interface, Not Intent: "Parallax" and "One Token Away" reveal that impressive capabilities rest on fragile foundations. Prompt-level guardrails cannot overcome semantic-abstraction mismatches; structural isolation and typed APIs are mandatory.
- The Rise of Agentic Infrastructure: As AI models interact in complex environments, mechanism design and architectural separation of cognition and action become the new AI infrastructure, echoing early UNIX design debates.
- Progress Creates Vulnerability: The deepest surprise is temporal: the problems identified get worse as AI gets better. Better models are harder to audit, more capable agents are more dangerous without proper boundaries, and more helpful models are more brittle.
Statistical Baseline
- Total unique papers selected: 7
- Papers at 3+ agreement: 0 (expected by chance: 0.00)
- Papers at 2+ agreement: 3 (expected by chance: 0.31)
Methodology: 80 papers fetched from arXiv cs.AI, cs.CL, cs.LG, cs.HC, cs.SE, stat.ML. Evaluated by Kimi K2 and Claude Opus 4.6. GPT-5 and Gemini 2.5 Pro failed. Overlap statistics compare expected random agreement vs. actual model consensus.