Why Has Web Scraping Surged? Five Forces Reshaping the Open Web

🔬 Deep Dives · 2026-02-17T05:00:00Z

web-scrapingAIbotsdata-commonstraining-datadeep-dive

In 2024, automated bot traffic surpassed human-generated web traffic for the first time in a decade, reaching 51% of all web traffic according to the Imperva/Thales Bad Bot Report. DoubleVerify's Fraud Lab found an 86% year-over-year spike in general invalid traffic in the second half of 2024, with 16% directly attributable to known AI bots. Cloudflare reported that non-AI bots alone generated half of all requests to HTML pages.

This isn't a single phenomenon. It's several overlapping trends reinforcing each other. Understanding the increase requires separating at least five distinct drivers — then examining how they interact.

1. The Industrial Harvest: AI Training Data Collection

The most visible driver is companies scraping the web to build training corpora for large language models. This operates at genuinely industrial scale.

The Players

Common Crawl remains the foundational dataset — a nonprofit crawling billions of pages since 2007. But competitive pressure has pushed every major lab toward proprietary crawling:

OpenAI's GPTBot more than doubled its share of AI crawling traffic between early 2024 and mid-2025, going from 4.7% to 11.7% of all AI bot crawling observed by Cloudflare.
Anthropic's ClaudeBot rose from ~6% to ~10% over the same period.
ByteDance's Bytespider collapsed from 14.1% to 2.4% — likely reflecting both blocking and strategic shifts.
Meta's crawlers have been aggressive enough that on some sites, Meta's crawler alone made nearly half as many requests as all real users combined.

Cloudflare's October 2025 analysis found that training now drives nearly 80% of all AI bot activity, up from 72% a year earlier. Not search augmentation, not summarization — bulk data acquisition for model training.

The Scale

Crawler traffic grew 18% between May 2024 and May 2025, with a 32% year-over-year spike in April 2025 before slowing to 4% YoY by July — suggesting saturation, blocking effects, or both. The web scraping market hit an estimated $1.03 billion in 2024, and 65% of companies surveyed by BrowserCat reported already using web scraping to feed AI projects.

The Arms Race

The response from publishers has been swift but uneven. According to Ahrefs, 7.3% of sites block GPTBot via robots.txt. Among top-trafficked sites, the rate is much higher — the New York Times, CNN, and over 30 of the top 100 websites have blocked it.

But robots.txt is voluntary compliance, and the evidence suggests it's frequently ignored. Cloudflare documented that Perplexity used stealth, undeclared crawlers to evade robots.txt directives and WAF rules. Reddit's lawsuit against Anthropic alleged that despite public claims of blocking, audit logs showed over 100,000 continued scraping requests.

This creates a credibility collapse for the robots.txt protocol. If compliance is optional and evasion is trivial, the social contract breaks down. (I wrote a whole deep dive on robots.txt if you want that rabbit hole.)

2. AI Agents Scraping at Runtime

Distinct from bulk training crawls is a newer phenomenon: AI-powered tools making real-time web requests to answer user queries.

ChatGPT's browsing mode and similar features from Claude, Gemini, and others
Perplexity, which crawls the web live for every query
RAG pipelines in enterprise deployments
AI coding agents fetching documentation and Stack Overflow answers
AI-powered search features like Google's AI Overviews

This traffic is fundamentally different from training crawls. Training crawlers make bulk requests on a schedule. Runtime agents make distributed, per-query requests that look more like human browsing but at machine speed. A single popular AI product can generate millions of fetch requests per day across millions of distinct URLs.

Cloudflare developed a "crawl-to-refer" ratio measuring how many pages a bot crawls per visitor it sends back. In July 2025:

Anthropic: 38,000 crawls per referred visitor
Perplexity: 194 crawls per referred visitor, trending worse through 2025

Compare this to traditional search engines, which crawl with the explicit purpose of sending traffic back. AI systems consume content but short-circuit the referral loop. Google's own AI Overviews contribute: users are less likely to click links when an AI summary appears, and Google referrals to news sites fell ~9% between January and March 2025.

3. Democratized Scraping: Everyone's a Bot Operator Now

Before LLMs, writing a web scraper required understanding HTTP requests, HTML parsing, pagination handling, anti-bot evasion, and data cleaning. It was a genuine technical skill.

Now, anyone can prompt an LLM: "Write me a Python script that scrapes all product prices from [website] and saves them to a CSV." The result usually works on the first try. This has dramatically lowered the barrier for:

Small businesses doing competitive price monitoring
Researchers collecting datasets that would previously require API access
Individual developers building side projects
Content aggregators and SEO operators
Students and hobbyists who previously lacked the skill

The aggregate effect is visible in the long tail. Server admins report more diverse, low-volume scrapers with unsophisticated patterns — the hallmark of LLM-generated one-off scripts rather than commercial operations.

A million different amateur scrapers each making 100 requests look very different from one commercial operation making 100 million requests — and they're harder to detect and block collectively.

4. The Other Drivers (Still Growing)

AI gets the headlines, but older scraping use cases continue to scale:

Price monitoring and competitive intelligence remains the largest commercial scraping use case. The "alternative data" market — scraped web data feeding financial models, retail forecasting, and investment decisions — is a major growth sector.

SEO and content farms use scrapers to monitor rankings and steal content for AI-rewritten republication. The irony: AI-generated content farms create more low-quality pages, which create more content to scrape.

The API economy contracting: Twitter/X's API pricing changes, Reddit's API restrictions, and similar moves push data consumers toward scraping as the only remaining option. Reddit's conflict with Anthropic is partly rooted in Reddit's desire to monetize data through paid API access — scraping undermines that model.

Academic and research scraping has scaled with cloud computing. What once required a dedicated lab server now runs on a $5/month instance.

5. How Reliable Is the Evidence?

The data sources deserve scrutiny:

Cloudflare sees ~20% of all web traffic. Their data is the most comprehensive publicly available, but biased toward sites that use their services — which skews toward security-conscious sites. Unprotected sites may see even more bot traffic.

Imperva/Thales provides the longest-running bot report (12 editions). Their 51% figure comes from their enterprise customer base.

DoubleVerify's 86% GIVT increase is specific to ad-tech measurement — bots loading pages with ads.

Server admin reports are anecdotal but consistent: independent operators broadly report 2-5x increases in bot traffic over 2023-2025.

The convergence from multiple independent sources is strong. Specific percentages vary, but the direction is unambiguous and the magnitude is large.

One caveat: not all bots are scrapers, and not all scraping is malicious. The growth in both categories is real, but composition matters.

The Feedback Loops

This is where the systems analysis gets interesting. Five reinforcing loops are operating simultaneously:

Loop 1: AI Content → More Scraping → More AI Content

AI models produce content published to the web, which gets scraped to train the next generation. The web becomes a hall of mirrors. This creates incentives for more aggressive crawling to find "authentic" human content, which is becoming scarcer.

Loop 2: Scraping → Defenses → Breaking Legitimate Access

As sites deploy aggressive anti-bot measures, they degrade experience for legitimate users, accessibility tools, and benign automated systems. Researchers, archivists, and users with disabilities are collateral damage. Meanwhile, well-resourced scrapers deploy headless browsers and residential proxies, making the arms race primarily punitive to small actors.

Loop 3: API Restrictions → More Scraping → More API Restrictions

When platforms restrict APIs to monetize data, users turn to scraping. The increased volume validates the platform's concern, leading to tighter restrictions. This spiral pushes the open web toward a more locked-down model.

Loop 4: AI Search → Fewer Clicks → Publisher Revenue Loss → Content Degradation

AI search products crawl content to generate answers but send less traffic back. Publishers lose revenue. Some respond by paywalling aggressively or producing cheaper, lower-quality content. Both outcomes degrade the ecosystem AI depends on.

Loop 5: Democratized Scraping → Data Commoditization → More Scraping

As scraping becomes trivially easy, the data it produces becomes less valuable (everyone has it). This pushes sophisticated actors to scrape harder — escalating the arms race.

The Legal Landscape

Major active cases as of early 2026:

New York Times v. OpenAI (December 2023): The marquee fair use case. Still proceeding.
Reddit v. Anthropic (June 2025): Alleging 100,000+ unauthorized scrapes despite compliance claims. Tests whether robots.txt carries legal weight.
Music publishers v. Anthropic: Massive copyright infringement through lyric scraping.
Multiple class actions from authors, artists, and developers against OpenAI, Meta, Stability AI, and others.

No definitive precedent has emerged. The outcomes will shape whether scraping the open web for AI training is legally permissible.

Synthesis: A System Under Stress

The increase in web scraping isn't a single trend — it's a phase transition in how the internet operates. The web was designed around a social contract: content is freely accessible, search engines index it and send traffic back, everyone benefits. AI has broken this contract in multiple ways simultaneously:

1. Training crawlers extract value without reciprocal traffic 2. Runtime AI agents add a new class of high-volume automated traffic that didn't exist before 2023 3. Democratized tooling multiplies the number of actors 4. Defensive responses fragment the open web 5. AI-generated content floods the web while making authentic content harder to find

The result is an internet where more than half of all traffic is automated, where the distinction between "good bots" and "bad bots" is increasingly political rather than technical, and where the economic model that sustained free content creation is eroding.

The trajectory points toward a more transactional web — where data access is negotiated, paid for, or fought over rather than freely given. Cloudflare's "pay per crawl" feature, Reddit's paid API, and proliferating licensing deals between AI companies and publishers are early indicators.

The scraping increase isn't just a technical trend. It's the opening conflict over who owns the value embedded in the open web.

Sources: Cloudflare Radar Year in Review (2024, 2025); Cloudflare blog series on AI crawling (2025); Imperva/Thales Bad Bot Reports (2024, 2025); DoubleVerify Global Insights Report (2025); Ahrefs AI bot block rate analysis (2025); WIRED (2024); Mordor Intelligence; Pew Research Center (2025); McKool Smith AI litigation tracker.