What Is robots.txt
A robots.txt file is a plain text file placed in the root of your website:
https://yourdomain.com/robots.txt
It follows the Robots Exclusion Protocol REP and tells automated crawlers bots which parts of your site they are allowed to access.
User-agent: * Disallow: /private/
This tells all bots not to crawl the /private/ directory.
Why AI Bots Matter Now
Traditionally robots.txt was used to manage search engine crawlers like Google and Bing.
Now AI companies crawl websites to
- Train large language models LLMs
- Improve search AI systems
- Build knowledge graphs
- Enhance chatbot responses
- Collect structured data
If you publish original content you may want to control how it is accessed.
In addition many publishers are concerned about AI companies harvesting content at scale and then using that material to train models or generate answers without compensating the original creators. This can reduce traffic to the source website, decrease advertising revenue, weaken subscription models, and shift value away from the businesses that invested in producing the content. At the same time AI companies may generate significant revenue from products and services built on that harvested material, even though the original creators receive no payment for the content that helped power those systems.
How AI Bots Use robots.txt
Most reputable AI companies claim they respect robots.txt.
When their crawler visits your site it
- Fetches /robots.txt
- Checks for rules that match its user agent
- Decides what to crawl or not crawl
User-agent: GPTBot Disallow: /
This blocks OpenAI crawler from your entire site.
How to Block AI Bots in robots.txt
You can block specific AI crawlers using
User-agent: [bot-name] Disallow: /
Or block multiple bots individually.
Example Block Multiple AI Bots
User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: PerplexityBot Disallow: /
Example: Allow Search Engines but Block AI Training Bots
# Allow standard search engines User-agent: Googlebot Disallow: User-agent: Bingbot Disallow: # Block AI training bots User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: PerplexityBot Disallow: /
This setup lets Google and Bing crawl your site for search indexing while preventing AI training bots from harvesting your content.
Top 25+ Known AI Bots 2026
Major AI Company Crawlers
- GPTBot – Used by OpenAI
- ChatGPT-User – OpenAI browsing requests
- ClaudeBot – Used by Anthropic
- Claude-Web – Anthropic web access agent
- anthropic-ai – Anthropic crawler
- Google-Extended – AI training control by Google
- GoogleOther – Google experimental crawler
- PerplexityBot – From Perplexity AI
- CCBot – From Common Crawl
- FacebookBot – From Meta
- Meta-ExternalAgent – Meta AI crawler
- Meta-ExternalFetcher – External Meta content fetcher
- facebookexternalhit – Meta link preview fetcher
- OAI-SearchBot – OpenAI search crawler
- Applebot-Extended – Apple extended crawler for AI and advanced features
- Perplexity-User – Perplexity user browsing requests
AI Search and Assistant Platforms
- YouBot – From You.com
- NeevaBot – From Neeva
- Bytespider – From ByteDance
- Amazonbot – From Amazon
- Applebot – From Apple
- DuckAssistBot – From DuckDuckGo
Data and Training Crawlers
- Diffbot – From Diffbot
- PetalBot – From Huawei
- AhrefsBot – From Ahrefs
- SemrushBot – From Semrush
- MJ12bot – From Majestic
- DotBot – From Moz
- ia_archiver – From Internet Archive
- img2dataset – Used for dataset generation
- Omgilibot – From Omgili data aggregator
- Omgili – Content aggregator crawler
- ImagesiftBot – Image dataset scraper
Emerging and AI Associated Crawlers
- MistralBot – From Mistral AI
- cohere-ai – From Cohere
- AlibabaBot – From Alibaba
- YandexGPTBot – From Yandex
- PanguBot – From Huawei
- FriendlyCrawler – Ethical AI dataset crawler
- Timpibot – Data indexer / AI training crawler
- VelenPublicWebCrawler – Web index and data gatherer
- Webzio-Extended – AI content discovery crawler
- Kangaroo Bot – Emerging AI data bot
- iaskspider2.0 – From China’s iAsk search
- Ai2Bot – From Allen Institute for AI
- Ai2Bot-Dolma – Allen Institute dataset crawler
- ICC-Crawler – Industrial/compliance crawler
- ISSCyberRiskCrawler – Cyber risk data crawler
- Sidetrade indexer bot – AI business data crawler
- DeepAIbot – From DeepAI
Important Considerations
robots.txt Is Not Enforcement
robots.txt is simply a request, not a firewall. Bots can ignore it.
For stronger control consider using/implementing
- Server level blocking Nginx or Apache rules
- Block by IP range
- Use a WAF such as Cloudflare or Akamai
- Require login for premium content
Blocking May Impact Visibility
Blocking GoogleExtended does not affect Google Search indexing it only controls AI model training.
However blocking core search bots like Googlebot or Bingbot can remove you from search results.
Logs Are Your Best Friend
Check your server logs to
- Confirm bot identity
- Detect fake user agents
- Identify high frequency scrapers
Should You Block AI Bots
| If You Want | Consider |
|---|---|
| Maximum exposure | Allow AI bots |
| Protect proprietary content | Block training bots |
| Monetize content | Use paywalls and server blocking |
| Control usage rights | Update Terms of Service |
Final Thoughts
AI crawling is now a permanent part of the web ecosystem.
The question is not whether bots will visit your site.
It is whether you are controlling them.
Start with
- A properly configured robots.txt
- Clear AI policy
- Log monitoring
- Server level protections
In 2026 digital content governance is no longer optional it is infrastructure.