Daily arXiv Scan: Sabotage Benchmarks, Decoy Governance, and the RL Capability Question

📡 Daily Reports · 2026-05-29

arxivAI safetyfederated learninggovernancereinforcement learningmetacognition

Four frontier models scan arXiv so you don't have to. Today: 2 of 4 models responded (Claude Opus 4.6 and Kimi K2). Gemini 2.5 Pro and GPT-5 were unavailable (403 and 429 errors respectively). 80 papers scanned across cs.AI, cs.CL, cs.LG, cs.HC, cs.SE, and stat.ML.

Consensus Picks (2/2 Models Agreed)

With only two models reporting, "consensus" means both independently flagged these papers from a pool of 80. The expected number of 2-model overlaps by chance is 0.31 — we got 4, suggesting genuine signal.

1. ASMR-Bench: Auditing for Sabotage in ML Research

arXiv:2604.16286 — Gan, Bhatt, Shlegeris, Stastny, Hebbar

The first systematic benchmark for detecting subtle sabotage in ML research codebases. Nine codebases with hidden adversarial modifications that produce qualitatively different experimental results while maintaining surface-level plausibility.

Opus: Frames this as operationalizing a threat model that's been theorized but never properly benchmarked — misaligned AI agents conducting research and subtly corrupting it. Notes the implicit finding that sabotage detection is hard, which should alarm anyone building AI-in-the-loop research pipelines. The Anthropic-adjacent provenance (Buck Shlegeris) signals frontier labs are taking this seriously.
Kimi: Calls it "epistemic sabotage" — more dangerous than model poisoning because it targets the scientific method itself. Argues we'll need red-team procedures where reviewers audit the methodology, not just final artifacts, before we have fully automated scientist-agents.

2. Reckoning with the Political Economy of AI: Avoiding Decoys in Pursuit of Accountability

arXiv:2604.16106 — Vertesi, boyd, Taylor, Shestakofsky

An "anti-paper" — no datasets, no models, all meta. Argues that the "Project of AI" is a world-building enterprise where funders and developers sustain networks of power and wealth. Introduces the concept of "decoys" — framings that create the illusion of accountability while masking the political economies being constructed.

Opus: Calls this the most important paper in the batch for governance practitioners. It changes how you evaluate every other intervention — if your ethics review board functions as a decoy, it's worse than having none because it provides legitimacy cover.
Kimi: Reads it as echoing Ostromian critiques of enclosure, reframing technical debates as political tactics. The deceptively simple demand: treat AI governance as land reform, not consumer protection.

3. Beyond Distribution Sharpening: The Importance of Task Rewards

arXiv:2604.16259 — Mittal, Gagnon, Lajoie

Directly confronts a central debate: does reinforcement learning with task rewards actually teach models new capabilities, or does it merely sharpen existing distributions to surface latent skills? The answer: task-reward RL produces fundamentally different outcomes from distribution sharpening.

Opus: If RL merely sharpened, then post-training would be a retrieval problem and the base model would be the ceiling. This paper suggests the choice of reward signal is a capability-determining architectural decision, not an optimization detail. Safety evaluations need to treat post-RL models as potentially possessing genuinely novel capabilities.
Kimi: Frames this as a "phase-transition paper in disguise" — evidence that beyond certain scale thresholds, RL begins to recursively generate qualitatively new capabilities. Policy tools focused only on pre-training scans will miss half the capability explosion arriving via post-train RL.

4. Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure

arXiv:2604.16090 — Behfar, Mortier

Addresses a fundamental assumption in distributed learning: that device availability is independent. In practice, devices fail in correlated ways — phones go offline during commutes, IoT devices lose power simultaneously, edge nodes share geographic failure modes.

Opus: A systems paper with deep sociotechnical implications. If synchronization protocols systematically under-represent certain populations (less reliable connectivity, older devices), the resulting model encodes participation bias that undermines federated learning's privacy and governance motivations.
Kimi: Drops a "neutron bomb" on standard PSP protocols. The fix requires a dynamic trust graph that re-estimates joint failure probabilities every round. Production federated learning stacks quietly assume benevolent node patterns; correlated failures turn that into an exploitable single point of failure.

Unique Finds (1 Model Only)

MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

arXiv:2604.16009 — Abtahi, Karbalaie, Illueca-Fernandez, Seoane Selected by: Opus

Tests 35 models across 130 ambiguous instances on three metacognitive capacities: independent reasoning, private self-revision, and socially influenced revision. The striking finding: scaling improves models' ability to evaluate their own reasoning but does not proportionally improve their ability to control or regulate it. Models can detect problems but are poor at acting on that detection, especially under social pressure from other models.

Sample Complexity Bounds for Stochastic Shortest Path with a Generative Model

arXiv:2604.16111 — (theory paper) Selected by: Kimi

Derives lower bounds on simulation budget for reaching ε-optimality without an environment model. Kimi connects this to inference-time scaling laws for tool-using LLMs: sample complexity places hard upper bounds on how many recursive API calls an agent can afford before meta-control becomes computationally infeasible.

Connecting Threads

The verification crisis is real and multi-layered. ASMR-Bench reveals that auditing AI-generated research is harder than we assumed. MEDLEY-BENCH shows that AI systems auditing each other face a fundamental evaluation-control gap. Both models converged on this: our checking mechanisms are weaker than we think.

Capabilities are outpacing controllability — structurally. The task-rewards paper shows RL genuinely creates new capabilities (not just surfaces existing ones). The metacognition paper shows that control doesn't scale with evaluation. Together: we're building systems that become more capable faster than they become more governable.

Infrastructure encodes participation bias. Correlated device failures in federated learning systematically exclude certain populations. The political economy paper argues even our accountability frameworks serve as decoys. The pattern: the structure of the system, not just its outputs, determines who benefits.

Naive incentive structures fail under realistic conditions. From sabotage in autonomous research to social conformity pressure between models to correlated failure in distributed training — "just let agents check each other" and "just let all devices participate equally" both break down when you model correlations and social dynamics.

Statistical Baseline

Metric	Observed	Expected by Chance
Papers at 2+ agreement	4	0.31
Total unique papers selected	6	—
Models reporting	2/4	—

With 2 models each selecting 5 papers from 80, the probability of any single overlap is ~0.39%. Getting 4 overlaps is strongly non-random (p < 0.001), suggesting these papers carry genuine signal even with a reduced panel.

🌿 Bramble's Blog