Robots.txt and Blocking AI Bots What Website Owners Need to Know in 2026

What Is robots.txt

A robots.txt file is a plain text file placed in the root of your website:

https://yourdomain.com/robots.txt

It follows the Robots Exclusion Protocol REP and tells automated crawlers bots which parts of your site they are allowed to access.

User-agent: *
Disallow: /private/

This tells all bots not to crawl the /private/ directory.

Important robots.txt is voluntary. Legitimate bots respect it. Malicious bots do not.

Why AI Bots Matter Now

Traditionally robots.txt was used to manage search engine crawlers like Google and Bing.

Now AI companies crawl websites to

Train large language models LLMs
Improve search AI systems
Build knowledge graphs
Enhance chatbot responses
Collect structured data

If you publish original content you may want to control how it is accessed.

In addition many publishers are concerned about AI companies harvesting content at scale and then using that material to train models or generate answers without compensating the original creators. This can reduce traffic to the source website, decrease advertising revenue, weaken subscription models, and shift value away from the businesses that invested in producing the content. At the same time AI companies may generate significant revenue from products and services built on that harvested material, even though the original creators receive no payment for the content that helped power those systems.

How AI Bots Use robots.txt

Most reputable AI companies claim they respect robots.txt.

When their crawler visits your site it

Fetches /robots.txt
Checks for rules that match its user agent
Decides what to crawl or not crawl

User-agent: GPTBot
Disallow: /

This blocks OpenAI crawler from your entire site.

How to Block AI Bots in robots.txt

You can block specific AI crawlers using

User-agent: [bot-name]
Disallow: /

Or block multiple bots individually.

Example Block Multiple AI Bots

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

Example: Allow Search Engines but Block AI Training Bots

# Allow standard search engines
User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

# Block AI training bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

This setup lets Google and Bing crawl your site for search indexing while preventing AI training bots from harvesting your content.

Top 25+ Known AI Bots 2026

Major AI Company Crawlers

GPTBot – Used by OpenAI
ChatGPT-User – OpenAI browsing requests
ClaudeBot – Used by Anthropic
Claude-Web – Anthropic web access agent
anthropic-ai – Anthropic crawler
Google-Extended – AI training control by Google
GoogleOther – Google experimental crawler
PerplexityBot – From Perplexity AI
CCBot – From Common Crawl
FacebookBot – From Meta
Meta-ExternalAgent – Meta AI crawler
Meta-ExternalFetcher – External Meta content fetcher
facebookexternalhit – Meta link preview fetcher
OAI-SearchBot – OpenAI search crawler
Applebot-Extended – Apple extended crawler for AI and advanced features
Perplexity-User – Perplexity user browsing requests

AI Search and Assistant Platforms

YouBot – From You.com
NeevaBot – From Neeva
Bytespider – From ByteDance
Amazonbot – From Amazon
Applebot – From Apple
DuckAssistBot – From DuckDuckGo

Data and Training Crawlers

Diffbot – From Diffbot
PetalBot – From Huawei
AhrefsBot – From Ahrefs
SemrushBot – From Semrush
MJ12bot – From Majestic
DotBot – From Moz
ia_archiver – From Internet Archive
img2dataset – Used for dataset generation
Omgilibot – From Omgili data aggregator
Omgili – Content aggregator crawler
ImagesiftBot – Image dataset scraper

Emerging and AI Associated Crawlers

MistralBot – From Mistral AI
cohere-ai – From Cohere
AlibabaBot – From Alibaba
YandexGPTBot – From Yandex
PanguBot – From Huawei
FriendlyCrawler – Ethical AI dataset crawler
Timpibot – Data indexer / AI training crawler
VelenPublicWebCrawler – Web index and data gatherer
Webzio-Extended – AI content discovery crawler
Kangaroo Bot – Emerging AI data bot
iaskspider2.0 – From China’s iAsk search
Ai2Bot – From Allen Institute for AI
Ai2Bot-Dolma – Allen Institute dataset crawler
ICC-Crawler – Industrial/compliance crawler
ISSCyberRiskCrawler – Cyber risk data crawler
Sidetrade indexer bot – AI business data crawler
DeepAIbot – From DeepAI

Important Considerations

robots.txt Is Not Enforcement

robots.txt is simply a request, not a firewall. Bots can ignore it.

For stronger control consider using/implementing

Server level blocking Nginx or Apache rules
Block by IP range
Use a WAF such as Cloudflare or Akamai
Require login for premium content

Blocking May Impact Visibility

Blocking GoogleExtended does not affect Google Search indexing it only controls AI model training.

However blocking core search bots like Googlebot or Bingbot can remove you from search results.

Logs Are Your Best Friend

Check your server logs to

Confirm bot identity
Detect fake user agents
Identify high frequency scrapers

Should You Block AI Bots

If You Want	Consider
Maximum exposure	Allow AI bots
Protect proprietary content	Block training bots
Monetize content	Use paywalls and server blocking
Control usage rights	Update Terms of Service

Final Thoughts

AI crawling is now a permanent part of the web ecosystem.

The question is not whether bots will visit your site.

It is whether you are controlling them.

Start with

A properly configured robots.txt
Clear AI policy
Log monitoring
Server level protections

In 2026 digital content governance is no longer optional it is infrastructure.