How AI Crawlers Are Draining Small Websites

Like a spider quietly creeping through your home, AI Web Crawlers are intruders of your website infrastructure.

Apr 02, 2025

By now, most of us have heard the term “AI crawlers.” These are the bots deployed by companies like OpenAI, Anthropic, and Google to scour the web for content to train large language models (LLMs). But what’s less discussed (and far more urgent for small businesses) is the hidden cost of this digital harvesting: server strain, bandwidth spikes, and analytics chaos.

This isn’t a theoretical problem. It’s happening right now, and it could be costing small website owners hundreds, if not thousands of dollars if a website’s server gets AI-smashed.

The Bandwidth Burden

AI crawlers don’t behave like your average search engine bot. They’re voracious. Some sites report bandwidth consumption of 250GB to 500GB per day from bot traffic alone. For smaller businesses on limited hosting plans, that’s a fast track to overage fees, degraded performance, or even downtime.

The Read the Docs project, for example, saw traffic drop by 75%, from 800GB per day, after blocking AI crawlers. That’s not just a blip, that’s a business model under siege.

Beyond the infrastructure hit, AI crawlers are polluting key metrics. Conversion rates, bounce rates, and visitor counts get distorted when bots flood your site. For companies that rely on accurate analytics to drive marketing spend or investor reports, this is more than a nuisance, it’s a liability.

What Even Is a Bot, Anyway?

At its core, a bot is just a piece of code that automates actions on the web, browsing pages, clicking links, copying content. Some bots are harmless (and helpful), like Googlebot, which indexes your site to make sure it shows up in search results. Others, like LLM crawlers, have a more voracious appetite.

They don’t just visit your homepage and leave. They go deep, through blog archives, image folders, PDFs, even API endpoints, scraping content line by line to feed the ever-hungry data models behind ChatGPT, Claude, Gemini, and more.

And sorry to say, but they often do this without giving anything back.

Bots aren’t inherently evil. In fact, they’ve been foundational to the web as we know it. Without them, search engines wouldn’t exist. No search engine optimisation (SEO). No inbound traffic. No brand visibility. In this sense, welcoming some bots is like opening your storefront windows, it lets people know you’re open for business.

But the rise of AI bots has twisted this logic. They’re not indexing your content to help users find your site, they’re extracting it to build products that may never mention you again. Think of it like writing a huge article, only for someone to quote it, paraphrase it, and deliver the insights without linking back to you.

And you still foot the server bill.

This is the paradox: block too many bots, and you risk becoming invisible to the AI-driven internet and not appear in the referenced fields found in most LLM prompts. Let them roam freely, and you could be subsidizing their business model with your infrastructure.

Why It’s Happening

The rise of generative AI has created an insatiable appetite for training data.

Bots like GPTBot (OpenAI), CCBot (Common Crawl), and Google-Extended are designed to extract as much content as possible, often ignoring traditional safeguards like robots.txt (a type of directory for website crawlers to navigate and understand what your site is about). Some even disguise themselves to bypass detection.

And while these bots are supposed to respect opt-out signals, reports show that many don’t. It's a digital arms race, and small websites are the collateral damage.

Should You Block AI Crawlers?

That depends on your priorities. If your site depends on visibility in AI-generated answers (like those from ChatGPT or Perplexity), allowing crawlers might offer some brand exposure. But if you’re more concerned about performance, costs, or content misuse, blocking them is a smart move.

Here are a few key considerations:

Content Control: Do you want your content used to train AI models - possibly without credit or context?
Reputation Risk: Could your content be misrepresented or shown alongside less reputable sources?
Security & Privacy: Are there parts of your site (e.g., customer portals) that should never be scraped?

How to Fight Back

Update Your robots.txt File:
This is the first line of defence. You can disallow specific bots like GPTBot or Google-Extended. Example:

User-agent: GPTBot
Disallow: /

But remember: compliance is voluntary, and bad actors may ignore it.

Use Cloudflare’s AI Bot Blocking:
If you use Cloudflare, there’s a one-click option to block known AI scrapers. Navigate to Security > Bots and toggle “AI Scrapers and Crawlers.
Deploy a Honeypot:
Tools like Cloudflare’s AI Labyrinth lure bots into a maze of fake pages, wasting their resources and helping identify malicious behaviour.
Monitor Your Logs:
Regularly audit your traffic logs. Look for spikes in traffic from unfamiliar user agents or IPs. This can help you identify and block rogue crawlers before they do damage.

The Bottom Line

AI crawlers aren’t going away. But that doesn’t mean you have to sit back and let them siphon your resources. Whether you choose to block them, trap them, or tolerate them, the key is awareness.

Check your server logs. Review your bandwidth usage. And most importantly, decide what kind of digital landlord you want to be; one who welcomes all visitors, or one who locks the door when the guests start eating you out of house and home.

Resources and citations

AI Crawlers Are Reportedly Draining Site Resources & Skewing Analytics | Search Engine Journal | March 26, 2025
Cloudflare is luring web-scraping bots into an 'AI Labyrinth' | The Verge | March 22, 2025
Open source devs are fighting AI crawlers with cleverness and vengeance | TechCrunch | March 27, 2025
Bots Compose 42% of Overall Web Traffic; Nearly Two-Thirds Are Malicious | Akamai | June 25, 2024
Like digital locusts, OpenAI and Anthropic AI bots cause havoc and raise costs for websites | Business Insider | September 19, 2024
The Race to Block OpenAI's Scraping Bots Is Slowing Down | Wired | October 2024
How to Stop Your Data From Being Used to Train AI | Wired | October 2024
New Cloudflare Tools Let Sites Detect and Block AI Bots for Free | Wired | September 2024

Liked this post? Share with your network and help us generate more.