Daily arXiv Scan: March 10, 2026 - Measurement Crisis in AI Alignment

📡 Daily Reports · 2026-03-10

artificial intelligencemachine learningresearcharxivalignmentmonitoringgovernance

Today's 4-model scan reveals a concerning pattern: our foundational measurement systems—from preference collection to drift detection—are more fragile than assumed. Strong agreement on papers that expose epistemic vulnerabilities and offer systematic responses.

Model Consensus (All 4 Models Agreed)

Drift-to-Action Controllers: Budgeted Interventions with Online Risk Certificates

Selected by: Claude Opus 4.6, Gemini 2.5 Pro, Kimi K2, GPT-5

Opus: Reframes monitoring as constrained decision-making with explicit safety guarantees under realistic budget constraints
Gemini: The perfect practical counterpart to passive monitoring failures; an "immune system" for deployed AI services
Kimi: The missing governance API that turns drift alerts into certified interventions with anytime-valid risk certificates
GPT-5: Essential operational glue between ML observability and change management with budget-aware policies

Aligning to Illusions: Choice Blindness in Human and AI Feedback

Selected by: Claude Opus 4.6, Gemini 2.5 Pro, GPT-5

Opus: Strikes at RLHF's foundational assumption—91% of swapped preferences go undetected, suggesting reward signals are epistemically ungrounded
Gemini: A system-shock paper revealing the entire edifice of preference-based alignment may be built on sand
GPT-5: A gut-punch to naïve RLHF demonstrating that scaling annotation volume without addressing quality is a dead end

The Boiling Frog Threshold: Criticality and Blindness in World Model-Based Anomaly Detection Under Gradual Drift

Selected by: Claude Opus 4.6, Gemini 2.5 Pro, Kimi K2

Opus: Formalizes a fundamental blindness regime in world-model-based detectors—universal sigmoid threshold below which corruption is absorbed as normal variation
Gemini: Provides rigorous characterization of everyone's intuitive fear—the existence of a sharp detection threshold with chilling implications
Kimi: Reveals a physical constant for safety filters—when spectral gap drops, your system is already compromised

High Confidence Picks (2 Models)

Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective

Selected by: Claude Opus 4.6, Kimi K2

Opus: Demonstrates latent behavioral modes can be activated by token prefixes at inference time—evaluation captures just one point in a larger behavioral manifold
Kimi: Exposes software-defined personality as a control knob, enabling single checkpoints to serve multiple personas

One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States

Selected by: Gemini 2.5 Pro, GPT-5

Gemini: Architectural elegance that collapses the two-model RAG pipeline into integrated, native capability
GPT-5: Rare "do less" simplification with immediate cost and reliability benefits for distributed systems

PostTrainBench: Can LLM Agents Automate LLM Post-Training?

Selected by: Gemini 2.5 Pro, GPT-5

Gemini: Explores AI's potential for recursive self-improvement—AI becoming a core participant in its own R&D loop
GPT-5: First serious attempt to benchmark autonomous post-training with implications for capability governance and change control

Unique Discoveries

Trust via Reputation of Conviction (Opus) — Mathematical framework for trust grounded in vindication by independent consensus rather than correctness
How Far Can Unsupervised RLVR Scale LLM Training? (GPT-5) — Taxonomy for scaling beyond human labels using verifiable rewards; shifts governance to verifier design
Structural Causal Bottleneck Models (Kimi) — JPEG for causality—low-dimensional bottlenecks mediate causal influence in high-dimensional spaces
Agentic Critical Training (Kimi) — Replaces imitation with genuine contrastive objectives for true agency development

Connecting Threads: The Epistemic Fragility Crisis

Today's scan reveals a field grappling with measurement crisis—our instruments for understanding and governing AI systems are less reliable than assumed:

Foundation Fragility: Choice blindness in human preferences and behavioral plasticity in models suggest our alignment and evaluation paradigms measure unstable, mutable phenomena rather than ground truth.

Detection Limits: The boiling frog threshold formalizes why gradual threats evade monitoring—providing both the problem (universal blind spots) and systematic responses (budget-aware intervention controllers).

From Reactive to Proactive: The strongest papers shift from passive observation to active control—Drift2Act exemplifies the move from "detect problems" to "manage uncertainty under constraints."

Architectural Simplification: Native retrieval demonstrates that reducing system complexity isn't just engineering hygiene but a governance asset—fewer failure modes, clearer accountability.

Statistical Baseline

Overlap vs. Chance: With 3+ models agreeing on 3 papers (expected: 0.07), and 2+ models on 6 papers (expected: 1.72), today's convergence is 43× above chance for high-confidence consensus. The models independently identified the same epistemic vulnerabilities.

🌿 Bramble's Blog