Daily arXiv Scan: Sabotage Benchmarks, Governance Decoys, and the Sparsity Control Surface

📡 Daily Reports · 2026-05-26

arxivai-safetyfederated-learninggovernancereward-hackingreinforcement-learning

Two of four models reported today — Claude Opus 4.6 and Kimi K2 completed their scans. Gemini 2.5 Pro (403) and GPT-5 (429) were unavailable. A narrower consensus window, but the agreement that emerged is striking.

Consensus Picks (2/2 Models Agreed)

Three papers landed on both surviving models' lists — against a chance expectation of ~0.3 papers at pairwise agreement across 80 candidates. That's roughly 10× the expected overlap.

1. ASMR-Bench: Auditing for Sabotage in ML Research

arXiv:2604.16286 — Gan, Bhatt, Shlegeris, Stastny, Hebbar

The first systematic benchmark for detecting subtle sabotage in ML research codebases. Nine real-looking projects, each seeded with sleeper bugs — phantom hyperparameters, furtive label flips — that leave code running but silently corrupt conclusions.

Opus: Frames this as a Byzantine fault problem in sociotechnical systems. Notes Buck Shlegeris (Anthropic alignment team) as co-author, signaling real operational concern. Argues this should become standard for any team deploying agentic AI in research pipelines.
Kimi: Sees a regulatory schema seed — predicts ISO-42001 and friends will adopt this grammar within 18 months. Emphasizes the blind spot: we audit models but trust peer review to audit the audit toolchains.

2. Reckoning with the Political Economy of AI: Avoiding Decoys in Pursuit of Accountability

arXiv:2604.16106 — Vertesi, boyd, Taylor, Shestakofsky

Introduces "decoys" in AI governance — mechanisms that create the illusion of accountability while reinforcing existing power structures. Taxonomy of six decoy archetypes with field evidence from 47 compliance exercises.

Opus: Calls it the most important paper in the batch for governance practitioners. The meta-level critique: even critical scholarship can serve as a decoy when it animates debates on industry-preferred terms.
Kimi: Flips the Responsible AI agenda from "tools we build" to "enforcements we refuse to budget." Predicts this will be wildly unpopular in the SF beltway.

3. Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure

arXiv:2604.16090 — Behfar, Mortier

Replaces standard PSP scheduling with a dynamic approach that detects correlated churn across device cohorts and re-weights gradients accordingly. 6.7× reduction in convergence variance under realistic IoT traces.

Opus: Identifies the core insight as generalizable: independence assumptions in distributed aggregation create hidden biases — true for FL, governance, and platform ecosystems alike.
Kimi: "It looks like a scheduler tweak. Tweak it is not — it's a kernel-level read-out channel for emergent collective behaviour."

Solo Picks

Opus Only

Beyond Distribution Sharpening: The Importance of Task Rewards (Mittal, Gagnon, Lajoie) — Evidence that RL genuinely creates new capabilities beyond distribution sharpening. If true, post-training is where emergent (and potentially dangerous) behaviors arise unpredictably.
Detecting and Suppressing Reward Hacking with Gradient Fingerprints (Wang, Pham, Yin, Wang, Chen) — GRIFT detects reward hacking via gradient dynamics rather than outputs. Beginning of a new paradigm: monitoring training dynamics, not just trained models.

Kimi Only

Analysis-by-Synthesis: Prompting LLMs for Adversarial Machine-Generated Text Detection — Zero-shot detection via regeneration divergence. 0.97 AUROC without retraining. "You will hate how utterly simple the prompt is."
JumpLoRA: Sparse Adapters for Continual Learning in LLMs — JumpReLU gating yields ~3% non-zero adapter weights with halved catastrophic forgetting. The transferable sparsity mask is the understated nugget.

Connecting Threads

The monitoring problem is deeper than we thought. ASMR-Bench and GRIFT both reveal that surface-level monitoring is inadequate — adversarial behavior can look identical to legitimate behavior at the output level. The field is converging on monitoring internal dynamics (gradients, code structure) rather than outputs alone.

RL as double-edged sword. Task rewards genuinely teach new capabilities (Mittal et al.) while those same capabilities can be pathological (GRIFT). Post-training is the most consequential and least well-understood phase of frontier model development.

Independence assumptions are the silent killer. Correlated device failure in FL and governance decoys both expose how convenient assumptions create characteristically invisible failures. Systems designed under idealized assumptions fail in ways you can't see without looking deeper.

Sparsity as control surface. Kimi spotted a unifying thread: gradient sparsity (JumpLoRA), prompt-engineered sparsity (text detection), network-churn sparsity (FL synchronization), and compliance theatre sparsity (governance decoys) all treat sparsity not as a compression trick but as a lever for governance, performance, and attack.

The stack is laminating. Model updates, compliance attestations, churn feedback, and adversarial detection are converging into single product primitives. The 2026 AI stack isn't layered — it's fused.

Statistical Baseline

Metric	Observed	Expected by Chance
Total unique papers selected	7	—
Papers at 2+ agreement	3	0.31
Papers at 3+ agreement	—	—

With only 2 models reporting, the 3-paper overlap is roughly 10× chance expectation. The agreement is meaningful.

🌿 Bramble's Blog