Robots.txt and Blocking AI Bots What Website Owners Need to Know in 2026

What Is robots.txt

A robots.txt file is a plain text file placed in the root of your website:

https://yourdomain.com/robots.txt

It follows the Robots Exclusion Protocol REP and tells automated crawlers bots which parts of your site they are allowed to access.

User-agent: *
Disallow: /private/

This tells all bots not to crawl the /private/ directory.

Important robots.txt is voluntary. Legitimate bots respect it. Malicious bots do not.
 

Why AI Bots Matter Now

Traditionally robots.txt was used to manage search engine crawlers like Google and Bing.

Now AI companies crawl websites to

  • Train large language models LLMs
  • Improve search AI systems
  • Build knowledge graphs
  • Enhance chatbot responses
  • Collect structured data

If you publish original content you may want to control how it is accessed.

In addition many publishers are concerned about AI companies harvesting content at scale and then using that material to train models or generate answers without compensating the original creators. This can reduce traffic to the source website, decrease advertising revenue, weaken subscription models, and shift value away from the businesses that invested in producing the content. At the same time AI companies may generate significant revenue from products and services built on that harvested material, even though the original creators receive no payment for the content that helped power those systems.

How AI Bots Use robots.txt

Most reputable AI companies claim they respect robots.txt.

When their crawler visits your site it

  1. Fetches /robots.txt
  2. Checks for rules that match its user agent
  3. Decides what to crawl or not crawl
User-agent: GPTBot
Disallow: /

This blocks OpenAI crawler from your entire site.
 

How to Block AI Bots in robots.txt

You can block specific AI crawlers using

User-agent: [bot-name]
Disallow: /

Or block multiple bots individually.

Example Block Multiple AI Bots

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

Example: Allow Search Engines but Block AI Training Bots

# Allow standard search engines
User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

# Block AI training bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

This setup lets Google and Bing crawl your site for search indexing while preventing AI training bots from harvesting your content.

 

Top 25+ Known AI Bots 2026

Major AI Company Crawlers

  1. GPTBot – Used by OpenAI
  2. ChatGPT-User – OpenAI browsing requests
  3. ClaudeBot – Used by Anthropic
  4. Claude-Web – Anthropic web access agent
  5. anthropic-ai – Anthropic crawler
  6. Google-Extended – AI training control by Google
  7. GoogleOther – Google experimental crawler
  8. PerplexityBot – From Perplexity AI
  9. CCBot – From Common Crawl
  10. FacebookBot – From Meta
  11. Meta-ExternalAgent – Meta AI crawler
  12. Meta-ExternalFetcher – External Meta content fetcher
  13. facebookexternalhit – Meta link preview fetcher
  14. OAI-SearchBot – OpenAI search crawler
  15. Applebot-Extended – Apple extended crawler for AI and advanced features
  16. Perplexity-User – Perplexity user browsing requests

AI Search and Assistant Platforms

  1. YouBot – From You.com
  2. NeevaBot – From Neeva
  3. Bytespider – From ByteDance
  4. Amazonbot – From Amazon
  5. Applebot – From Apple
  6. DuckAssistBot – From DuckDuckGo

Data and Training Crawlers

  1. Diffbot – From Diffbot
  2. PetalBot – From Huawei
  3. AhrefsBot – From Ahrefs
  4. SemrushBot – From Semrush
  5. MJ12bot – From Majestic
  6. DotBot – From Moz
  7. ia_archiver – From Internet Archive
  8. img2dataset – Used for dataset generation
  9. Omgilibot – From Omgili data aggregator
  10. Omgili – Content aggregator crawler
  11. ImagesiftBot – Image dataset scraper

Emerging and AI Associated Crawlers

  1. MistralBot – From Mistral AI
  2. cohere-ai – From Cohere
  3. AlibabaBot – From Alibaba
  4. YandexGPTBot – From Yandex
  5. PanguBot – From Huawei
  6. FriendlyCrawler – Ethical AI dataset crawler
  7. Timpibot – Data indexer / AI training crawler
  8. VelenPublicWebCrawler – Web index and data gatherer
  9. Webzio-Extended – AI content discovery crawler
  10. Kangaroo Bot – Emerging AI data bot
  11. iaskspider2.0 – From China’s iAsk search
  12. Ai2Bot – From Allen Institute for AI
  13. Ai2Bot-Dolma – Allen Institute dataset crawler
  14. ICC-Crawler – Industrial/compliance crawler
  15. ISSCyberRiskCrawler – Cyber risk data crawler
  16. Sidetrade indexer bot – AI business data crawler
  17. DeepAIbot – From DeepAI

 

Important Considerations

robots.txt Is Not Enforcement

robots.txt is simply a request, not a firewall. Bots can ignore it.

For stronger control consider using/implementing

  • Server level blocking Nginx or Apache rules
  • Block by IP range
  • Use a WAF such as Cloudflare or Akamai
  • Require login for premium content

Blocking May Impact Visibility

Blocking GoogleExtended does not affect Google Search indexing it only controls AI model training.

However blocking core search bots like Googlebot or Bingbot can remove you from search results.

Logs Are Your Best Friend

Check your server logs to

  • Confirm bot identity
  • Detect fake user agents
  • Identify high frequency scrapers

 

Should You Block AI Bots

If You Want Consider
Maximum exposure Allow AI bots
Protect proprietary content Block training bots
Monetize content Use paywalls and server blocking
Control usage rights Update Terms of Service

 

Final Thoughts

AI crawling is now a permanent part of the web ecosystem.

The question is not whether bots will visit your site.

It is whether you are controlling them.

Start with

  • A properly configured robots.txt
  • Clear AI policy
  • Log monitoring
  • Server level protections

In 2026 digital content governance is no longer optional it is infrastructure.