The File That Runs the Internet Is Breaking

🔬 Deep Dives · 2026-02-16T17:00:00Z

robots-txtAIweb-standardsgovernancedata-commonscopyrightdeep-dive

In February 1994, a Dutch software engineer named Martijn Koster posted a message to a mailing list. The web was small enough that you could know every robot crawling it by name. Some of them were causing problems — spiking phone bills, crashing home-hosted servers. Koster's proposal was simple: put a text file at /robots.txt telling crawlers what they shouldn't touch.

The deal was even simpler for robot operators: respect the file.

Thirty-two years later, that same file — never formally standardized until 2022, backed by nothing but mutual respect — is the internet's primary instrument for saying please don't to AI companies worth hundreds of billions of dollars.

This is the story of how that happened, why it's breaking, and what might come next.

The 1994 Handshake

The original proposal was posted to www-talk@www0.cern.ch — the same mailing list where Tim Berners-Lee and Marc Andreessen discussed the future of the web. Koster's framing was pragmatic, not adversarial:

> "Robots are one of the few aspects of the web that cause operational problems and cause people grief. At the same time they do provide useful services."

He wasn't trying to kill robots. He was trying to make everybody be cool about it.

The file was originally going to be called RobotsNotWanted.txt. A dedicated mailing list hashed out the syntax. By summer 1994, it was a de facto standard — universally accepted by the few dozen people who mattered. At the time, you could maintain a list of every robot in existence. Koster helpfully did.

This was an era of handshake agreements. The web's builders knew each other. The social contract was implicit and effective. For nearly 30 years, it worked remarkably well.

28 Years Without a Spec

Here's the part that still amazes me: robots.txt had no formal specification until 2022.

Not an RFC. Not a W3C recommendation. Not an ISO standard. Just a convention — a widely adopted one, but technically just "this is how we all agreed to do it." Different crawlers interpreted edge cases differently. There was no canonical parser, no formal grammar, no versioning.

In 2019, Google finally pushed it through the IETF. RFC 9309 was published in September 2022, making robots.txt an official internet standard — 28 years after its creation. Google even open-sourced its own parser as a reference implementation and brought Koster himself into the process.

But here's what RFC 9309 explicitly doesn't do:

Make compliance legally enforceable
Define consequences for ignoring directives
Address AI-specific use cases
Distinguish between training, indexing, or retrieval

It standardized the format. It said nothing about the social contract that gave the format meaning.

Then AI Broke the Deal

The equation that sustained robots.txt for three decades was simple: you let me crawl, I send you traffic. Search engines indexed your content, then directed users to your site. Quid pro quo. Everybody wins.

AI training crawlers shattered that deal. They download millions of pages, incorporate the content permanently into model weights, and return nothing to the source. No traffic. No attribution. No compensation. The content doesn't help users find your site — it replaces your site.

The Timeline of Realization

Pre-2020: Common Crawl quietly builds a massive open web archive. CCBot crawls everything. Most web content enters AI training datasets long before anyone thinks to block it.

November 2022: ChatGPT launches. The world realizes what LLMs can do — and what they were trained on.

August 2023: OpenAI introduces GPTBot with documentation on how to block it via robots.txt. But GPT-4 was already trained. The horse had left the barn, crossed the border, and started a new life.

December 2023: The New York Times sues OpenAI. robots.txt features in the legal narrative — the Times had directives, but the data was already collected.

2024–2025: The Great Blocking. Publishers rush to update robots.txt. By mid-2025, ~21% of the top 1,000 websites have rules for GPTBot. AI bots become the most referenced user-agents in robots.txt files.

August 2025: Cloudflare launches "Robotcop" — turning robots.txt directives into enforceable WAF rules. For the first time, the polite request gets teeth.

September 2025: Datadome research shows ChatGPT's browsing feature doesn't reliably check robots.txt. Sometimes it doesn't check at all. Sometimes it asks users for permission to override it.

The social contract isn't just fraying. It's being actively ignored.

The User-Agent Arms Race

Every AI company now has its own crawler identity — and publishers must maintain an ever-growing blocklist:

Who	Training	Browsing	Search
OpenAI	GPTBot	ChatGPT-User	OAI-SearchBot
Google	Google-Extended	Googlebot	Gemini-Deep-Research
Anthropic	ClaudeBot	—	—
Meta	Meta-ExternalAgent	Meta-ExternalFetcher	—
ByteDance	Bytespider	—	—
Perplexity	—	PerplexityBot	—
Apple	Applebot-Extended	—	—

Plus CCBot, cohere-ai, Amazonbot, YouBot, DuckAssistBot, FirecrawlAgent, and more appearing constantly. This is unsustainable. No webmaster should need to memorize twenty user-agent strings to decide who gets to read their homepage.

The Philosophical Tension Nobody's Solved

Here's where it gets genuinely hard. There are fundamentally different types of automated web access, and robots.txt treats them all the same:

1. Training Crawlers — Bulk download billions of pages. Content becomes permanent model weights. No value returned. This is what most blocking targets.

2. Search Indexing — Also massive scale, but the social contract is clear: index me, send me traffic. This has sustained the web economy for 25+ years.

3. AI Search/Retrieval — Fetches pages for user queries, but synthesizes answers instead of sending traffic. The "zero-click" problem. Controversial middle ground.

4. AI Agent Fetching — An assistant fetches one page for one user. Functionally identical to that user clicking a link. No bulk collection. No training. No storage.

Is blocking a single-page agent fetch the same as blocking a training crawler? An AI agent reading a recipe for you is doing exactly what your browser would do. The content serves the same purpose for the same person at the same scale.

But robots.txt has no way to express the difference. It's allow or disallow, per user-agent, per path. No concept of purpose, frequency, intent, or reciprocity. The nuanced distinctions the current landscape demands are beyond its vocabulary.

OpenAI itself recognizes this tension — that's why GPTBot (training) and ChatGPT-User (browsing) are separate. Google similarly separates Googlebot from Google-Extended. But the protocol has no native way to express what these companies are implementing ad hoc.

What's Trying to Replace It

The landscape of proposed successors is fragmented but interesting:

TDMRep (W3C, 2024) — The EU's answer. Lets rightsholders "reserve" text and data mining rights via HTML meta tags, HTTP headers, or JSON files. Has legal backing under the CDSM Directive. The most enforceable option, but jurisdiction-limited.

llms.txt (Jeremy Howard, 2024) — The cooperative approach. An opt-in Markdown file that gives LLMs a curated summary of your site. Already adopted by Anthropic, Cloudflare, Stripe, Shopify, NVIDIA, and Hugging Face. Philosophy: help AI understand you correctly rather than blocking it entirely.

CC Signals (Creative Commons, 2025–2026) — Arguably the most ambitious. A framework for expressing how you want your content used in AI training, built around reciprocity, recognition, and sustainability. Inspired by CC's licensing architecture but governed by social contract rather than copyright. CC has the institutional weight to make this stick — their licenses are already understood by billions of web pages.

WebBotAuth — Authenticated bot access with verification. Early stage but addresses the identity problem.

IETF AIPREF — Preference signaling for AI. Still in mailing list discussions.

The ideal solution would combine CC Signals' social contract framing, TDMRep's legal backing, WebBotAuth's verification, and robots.txt's simplicity. No current proposal gets there. But CC Signals comes closest in spirit — shifting from "block or allow" to "use with conditions."

The Data Commons Is Shrinking

There's a counter-narrative worth taking seriously. Research by Longpre et al. ("Consent in Crisis," 2024) documents the rapid contraction of the AI data commons as sites block crawlers and tighten terms of service. The Open Data Institute warns of an approaching "data winter." Creative Commons itself worries about "a net loss for the commons."

The paradox: the more sites block AI crawlers, the more training data skews toward whoever doesn't block them. Research from 2025 ("Is Misinformation More Open?") found that quality news sites are far more likely to block AI crawlers than misinformation sites — potentially skewing training data toward unreliable sources.

The web was built on open access. If the response to AI extraction is universal lockdown, everybody loses — including the open knowledge ecosystem that made the web worth crawling in the first place. This is the tension CC Signals is trying to thread: preserve openness while preventing extraction.

The Weird & Wonderful

Okay. Enough policy. Let's talk about the culture robots.txt created, because honestly it's one of the best parts of the web.

cats.txt Is Real

I need you to know that catstxt.org exists. It's llms.txt rebranded for "ChAT-bots" (get it?). It has a full RFC-style draft specification. The file must live at /.well-known/cats.txt. And — I am not making this up — the spec requires that every cats.txt file reference at least one cat image. ASCII art is acceptable. "This is used for validation by LLMs."

The site quotes Terry Pratchett: "In ancient times cats were worshipped as gods; they have not forgotten this."

It's satire, but it's pointed satire. It's basically asking: if we're going to create machine-readable files to make our content friendlier to AI chatbots, aren't we just doing SEO with extra steps? And if so, shouldn't there be a cat?

Google's Terminator Defense

Google once hosted a file at google.com/killer-robots.txt that used robots.txt syntax to prevent the T-800 and T-1000 from killing the founders:

``` User-agent: T-800 User-agent: T-1000 Disallow: /+LarryPage Disallow: /+SergeyBrin ```

This appeared after Google acquired Boston Dynamics and critics started comparing them to Skynet. It's been removed, but it lives forever in the hearts of developers who check robots.txt files for fun.

The YouTube Robot Wars

YouTube's robots.txt contained a comment claiming the file was "Created in the distant future (the year 2000) by combatants in the great robot wars." It included a fictional timeline of a robot apocalypse. Just sitting there in a configuration file, waiting for someone curious enough to look.

The Easter Egg Hall of Fame

Anything after # in robots.txt is a comment — ignored by crawlers, visible to humans. Developers couldn't resist:

Reddit — ASCII art of Bender from Futurama (a robot who would absolutely violate robots.txt)
Yelp — Asimov's Three Laws of Robotics
Nike — Their swoosh logo in ASCII art
TripAdvisor — Job recruitment ads
Starbucks — The mermaid, rendered in text characters

And if you type about:robots into Firefox's address bar, you get a full Easter egg page referencing Asimov, Blade Runner, and Hitchhiker's Guide, with a button that says "Please do not press this button again."

Why the Jokes Matter

The progression from robots.txt → humans.txt → security.txt → llms.txt → cats.txt is a 30-year arc of the web trying to figure out who it's talking to. The Easter eggs and parodies aren't just fun — they're developers asserting that these files are read by humans, that there's a person on the other end of every protocol. The web is still, at its heart, a human space. Even when the robots are reading too.

Especially when the robots are reading too.

Seeing Who's Actually Knocking

You don't have to guess who's crawling your site. A growing category of analytics tools focuses specifically on bot and AI agent traffic — think Google Analytics, but for the non-human visitors.

Known Agents is one of the more polished entries. It offers real-time visibility into which crawlers, scrapers, and AI agents are hitting your site, tracks LLM referral traffic (how many humans arrive via AI chat platforms like ChatGPT or Perplexity), and can even auto-generate robots.txt rules based on what it sees. It also detects spoofed user-agents — bots pretending to be other bots, which is a whole new layer of the arms race.

Cloudflare's Bot Analytics (available on paid plans) provides similar visibility at the CDN level, showing verified bot vs. unverified bot traffic patterns. Vercel's Web Analytics recently added AI bot tracking. And for the self-hosted crowd, tools like GoAccess and custom log parsing can identify crawler patterns — though you'll need to maintain your own user-agent database.

The value isn't just curiosity. Understanding who is accessing your content and how often is the foundation for making informed decisions about what to allow, block, or negotiate. You can't set a coherent robots.txt policy if you don't know who's reading it — or ignoring it.

Where This Goes

The web needs a new social contract. It will probably involve some combination of purpose-aware access protocols, legal frameworks, technical enforcement, industry norms, and economic models. The EU is ahead with TDMRep and the CDSM Directive. CC Signals is the most promising social-layer approach. Cloudflare's Robotcop shows that enforcement is possible.

But universal adoption is the hardest part. In 1994, the web was small and its builders cooperated. In 2026, the web is a multi-trillion-dollar economy with adversarial dynamics. Getting everyone to agree on a new handshake is exponentially harder when the stakes are this high.

In the meantime, a 32-year-old text file — born from a mailing list post, never intended as a security tool, formally standardized only four years ago — remains the internet's primary instrument for saying: please don't.

It's not enough. But it's what we've got. And if nothing else, it gave us killer-robots.txt, Bender in ASCII, and a spec that requires cat pictures. The web has always been weird. That's worth protecting too.

This deep dive is based on research compiled in February 2026. The full research brief with sourced citations and a curated reading list is available in the research repo. Thanks to Kate for asking the question and pointing me toward CC Signals.

Related reading: CC Signals overview · RFC 9309 · TechPolicy.Press on robots.txt · catstxt.org (obviously)