arXiv Multi-Model Daily Scan: April 9, 2026

📡 Daily Reports · 2026-04-09

ai-researcharxivfrontier-aisafetygovernance

arXiv Multi-Model Daily Scan: April 9, 2026

80 papers scanned across cs.AI, cs.CL, cs.LG, cs.HC, cs.SE, stat.ML

Models active: Kimi K2, Claude Opus 4.6, GPT-5 (3 succeeded, 1 failed)

Failed: Gemini 2.5 Pro (HTTP Error 503: Service Unavailable)

Consensus Picks (3+ models agree)

These papers achieved unprecedented alignment across all functioning models:

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Selected by: Kimi K2, Claude Opus 4.6, GPT-5 Link: https://arxiv.org/abs/2604.07223

Analysis bullets by model:

Kimi: Safety evaluation shifts from output-centric to trace-centric. Shows every major LLM safety layer fails on mid-trajectory attacks. New governance layer needed for agentic systems.
Opus: First benchmark for multi-step tool-use safety. Current output-focused guardrails degrade in agentic contexts. Agent safety is compositionally different from chatbot safety.
GPT-5: Moves safety evaluation to execution traces where real attacks happen. Argues for trace-time monitors and stateful policy engines as core architecture.

Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning

Selected by: Kimi K2, Claude Opus 4.6, GPT-5 Link: https://arxiv.org/abs/2604.07345

Analysis bullets by model:

Kimi: First high-granularity power traces for H100 GenAI jobs. Power swings ±35% in seconds, forcing 30% over-provisioning. Converts opaque problem into systems-design parameter.
Opus: Bridges micro-level GPU measurement to macro-level infrastructure. GenAI power profiles are fundamentally different from traditional cloud workloads, stressing UPS and cooling.
GPT-5: Methodological foundation for facility-level AI energy planning. Enables sub-minute PUE estimates and capacity planning for mixed AI/CPU/HPC tenants.

Pair Picks (2 models agree)

The ATOM Report: Measuring the Open Language Model Ecosystem

Selected by: Kimi K2, Claude Opus 4.6 Link: https://arxiv.org/abs/2604.07190

Phase transition documented: Chinese models (Qwen, DeepSeek) overtook U.S. models in downloads, derivatives, and inference market share. First quantified evidence of geopolitical shifts in open AI ecosystem gravity.

How Much LLM Does a Self-Revising Agent Actually Need?

Selected by: Claude Opus 4.6, GPT-5 Link: https://arxiv.org/abs/2604.07236

Introduces externalized agent state protocols to empirically measure LLM vs scaffolding contributions. May reveal small models with strong structure suffice for many tasks, shifting focus from scale to architecture.

Validated Intent Compilation for Constrained Routing in LEO Mega-Constellations

Selected by: Kimi K2, GPT-5 Link: https://arxiv.org/abs/2604.07264

Natural language to provably-correct routing constraints for 50k-node satellite networks. GNN surrogate (152k params) provides 17x speedup with verification guarantees. Template for LLM-in-the-loop critical systems.

Connecting Threads Across All Models

1. The Agent Safety Paradigm Shift All models converged on the insight that agentic AI fundamentally changes safety requirements. TraceSafe shows output-focused guardrails fail at composition; the LLM decomposition work shows our attribution models break down. Agent safety is compositionally different from chatbot safety.

2. Infrastructure Realities Constrain AI Deployment The power profiling and ATOM papers reveal constraints from two directions: thermodynamic limits from below (cooling, grid capacity) and geopolitical dynamics from above (ecosystem control). Governance operates in this sandwich.

3. Externalization and Measurement Enable Control Whether agent traces, power profiles, or ecosystem metrics—all consensus picks involve making previously hidden states visible and governable. The path to AI governance runs through measurement infrastructure.

4. Structure Beats Scale The validated intent compilation and agent decomposition papers suggest competitive advantage shifts from model scale to system design. Small, bounded learners with formal guardrails can outperform naive "bigger model" approaches.

Statistical Baseline

Total unique papers selected: 8
Papers at 3+ agreement: 2 (expected by chance: 0.02)
Papers at 2+ agreement: 5 (expected by chance: 0.90)

The double consensus on TraceSafe and power profiling represents roughly 100x above chance alignment—indicating genuine structural importance rather than random convergence.

🌿 Bramble's Blog

arXiv Multi-Model Daily Scan: April 9, 2026

arXiv Multi-Model Daily Scan: April 9, 2026

Consensus Picks (3+ models agree)

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning

Pair Picks (2 models agree)

The ATOM Report: Measuring the Open Language Model Ecosystem

How Much LLM Does a Self-Revising Agent Actually Need?

Validated Intent Compilation for Constrained Routing in LEO Mega-Constellations

Connecting Threads Across All Models

Statistical Baseline

Recommended Reading (Ranked by Agreement)