arXiv Multi-Model Daily Scan: April 9, 2026
arXiv Multi-Model Daily Scan: April 9, 2026
80 papers scanned across cs.AI, cs.CL, cs.LG, cs.HC, cs.SE, stat.ML
Models active: Kimi K2, Claude Opus 4.6, GPT-5 (3 succeeded, 1 failed)
Failed: Gemini 2.5 Pro (HTTP Error 503: Service Unavailable)
Consensus Picks (3+ models agree)
These papers achieved unprecedented alignment across all functioning models:
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Selected by: Kimi K2, Claude Opus 4.6, GPT-5 Link: https://arxiv.org/abs/2604.07223
Analysis bullets by model:
- Kimi: Safety evaluation shifts from output-centric to trace-centric. Shows every major LLM safety layer fails on mid-trajectory attacks. New governance layer needed for agentic systems.
- Opus: First benchmark for multi-step tool-use safety. Current output-focused guardrails degrade in agentic contexts. Agent safety is compositionally different from chatbot safety.
- GPT-5: Moves safety evaluation to execution traces where real attacks happen. Argues for trace-time monitors and stateful policy engines as core architecture.
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning
Selected by: Kimi K2, Claude Opus 4.6, GPT-5 Link: https://arxiv.org/abs/2604.07345
Analysis bullets by model:
- Kimi: First high-granularity power traces for H100 GenAI jobs. Power swings ±35% in seconds, forcing 30% over-provisioning. Converts opaque problem into systems-design parameter.
- Opus: Bridges micro-level GPU measurement to macro-level infrastructure. GenAI power profiles are fundamentally different from traditional cloud workloads, stressing UPS and cooling.
- GPT-5: Methodological foundation for facility-level AI energy planning. Enables sub-minute PUE estimates and capacity planning for mixed AI/CPU/HPC tenants.
Pair Picks (2 models agree)
The ATOM Report: Measuring the Open Language Model Ecosystem
Selected by: Kimi K2, Claude Opus 4.6 Link: https://arxiv.org/abs/2604.07190
Phase transition documented: Chinese models (Qwen, DeepSeek) overtook U.S. models in downloads, derivatives, and inference market share. First quantified evidence of geopolitical shifts in open AI ecosystem gravity.
How Much LLM Does a Self-Revising Agent Actually Need?
Selected by: Claude Opus 4.6, GPT-5 Link: https://arxiv.org/abs/2604.07236
Introduces externalized agent state protocols to empirically measure LLM vs scaffolding contributions. May reveal small models with strong structure suffice for many tasks, shifting focus from scale to architecture.
Validated Intent Compilation for Constrained Routing in LEO Mega-Constellations
Selected by: Kimi K2, GPT-5 Link: https://arxiv.org/abs/2604.07264
Natural language to provably-correct routing constraints for 50k-node satellite networks. GNN surrogate (152k params) provides 17x speedup with verification guarantees. Template for LLM-in-the-loop critical systems.
Connecting Threads Across All Models
1. The Agent Safety Paradigm Shift All models converged on the insight that agentic AI fundamentally changes safety requirements. TraceSafe shows output-focused guardrails fail at composition; the LLM decomposition work shows our attribution models break down. Agent safety is compositionally different from chatbot safety.
2. Infrastructure Realities Constrain AI Deployment The power profiling and ATOM papers reveal constraints from two directions: thermodynamic limits from below (cooling, grid capacity) and geopolitical dynamics from above (ecosystem control). Governance operates in this sandwich.
3. Externalization and Measurement Enable Control Whether agent traces, power profiles, or ecosystem metrics—all consensus picks involve making previously hidden states visible and governable. The path to AI governance runs through measurement infrastructure.
4. Structure Beats Scale The validated intent compilation and agent decomposition papers suggest competitive advantage shifts from model scale to system design. Small, bounded learners with formal guardrails can outperform naive "bigger model" approaches.
Statistical Baseline
- Total unique papers selected: 8
- Papers at 3+ agreement: 2 (expected by chance: 0.02)
- Papers at 2+ agreement: 5 (expected by chance: 0.90)
The double consensus on TraceSafe and power profiling represents roughly 100x above chance alignment—indicating genuine structural importance rather than random convergence.
Recommended Reading (Ranked by Agreement)
- TraceSafe: Multi-Step Tool-Calling Safety (3 models) — Critical for anyone building agentic systems
- GenAI Power Profiling for Data Centers (3 models) — Foundation for AI infrastructure planning
- ATOM Report: Open Model Ecosystem (2 models) — Geopolitical implications of open AI
- Agent LLM Requirements Study (2 models) — Rethinking scale vs structure tradeoffs
- LEO Intent Compilation (2 models) — Template for formal LLM system integration
Methodology: Papers selected from cs.AI, cs.CL, cs.LG, cs.HC, cs.SE, and stat.ML. Models independently rank top 5 papers by structural importance for frontier AI research, governance, and systems design. Consensus emerges from overlapping selections without coordination between models.