Every AI Crawler You Need to Know in 2025

A complete map of all major AI crawlers powering ChatGPT, Gemini, Claude, Perplexity, Copilot, Apple Intelligence, and more.

Who crawls your site, why, and how it affects your AI search visibility.

1. OpenAI (ChatGPT / GPT-4.1 / GPT-5)

GPTBot

  • Purpose: Model training data collection
  • Control: User-agent: GPTBot
  • Notes: Used for training, not for retrieval.

OAI-SearchBot

  • Purpose: Fetches content for ChatGPT Search (citations + real-time answers)
  • Control: User-agent: OAI-SearchBot
  • Notes: Not used for training; only for search visibility.

ChatGPT-User

  • Purpose: On-demand real-time fetch when a user asks ChatGPT to load a URL
  • Control: User-agent: ChatGPT-User
  • Notes: Behaves like a browser; session-based.

2. Anthropic (Claude)

ClaudeBot

  • Purpose: Model training; broad web crawling
  • Control: User-agent: ClaudeBot
  • Notes: Used for improving Claude’s foundation models.

Claude-User

  • Purpose: User-triggered URL fetch inside Claude
  • Control: User-agent: Claude-User
  • Notes: Not for training; similar to ChatGPT-User.

3. Perplexity

PerplexityBot

  • Purpose: Indexing + retrieval for real-time answers
  • Control: User-agent: PerplexityBot
  • Notes: Known to crawl aggressively; some reports of UA impersonation if blocked.

Perplexity-User

  • Purpose: On-demand fetching during Q&A
  • Notes: Not used for training.

4. Google (Gemini, AI Overviews, AI Mode)

Googlebot family

  • Purpose: Primary crawler for Search (feeds AIO + AI Mode)
  • Control: User-agent: Googlebot
  • Notes: All generative experiences depend on standard Googlebot retrieval.

Google-Extended

  • Purpose: Opt-out token for model training & generative features
  • Control: User-agent: Google-Extended
  • Notes: Token, not a crawler. Does not fetch.

5. Apple (Apple Intelligence)

Applebot

  • Purpose: Indexing for Siri, Spotlight, Apple services
  • Control: User-agent: Applebot

Applebot-Extended

  • Purpose: Opt-out for Apple’s model training
  • Control: User-agent: Applebot-Extended
  • Notes: Token equivalent to Google-Extended.

6. Microsoft (Bing / Copilot / Edge Assistant)

bingbot

  • Purpose: Core Bing index (feeds Copilot AI answers)
  • Control: User-agent: bingbot

7. You.com

YouBot

  • Purpose: Crawling for You.com’s AI search
  • Control: User-agent: YouBot

8. Cohere

cohere-training-data-crawler

  • Purpose: Training crawler
  • Control: User-agent: cohere-training-data-crawler

cohere-ai

  • Purpose: On-demand fetcher used by Cohere chat products
  • Notes: Observed in the wild; mixed behavior.

9. Common Crawl

CCBot

  • Purpose: Open-source crawl used in many AI model training datasets
  • Control: User-agent: CCBot
  • Notes: Major upstream data source for AI companies.

10. Allen Institute (AI2 / Semantic Scholar)

AI2Bot

  • Purpose: Research crawling; feeds Semantic Scholar
  • Control: User-agent: AI2Bot

11. Meta

FacebookBot / facebookexternalhit / meta-externalagent

  • Purpose: Social previews; possible use in Meta AI
  • Notes: Not directly confirmed as AI retrieval bots.

12. ByteDance (TikTok / Toutiao / CapCut)

Bytespider

  • Purpose: Wide crawl; supports TikTok/AI content features
  • Control: User-agent: Bytespider

13. Amazon

Amazonbot

  • Purpose: Crawling for Amazon properties, potentially AI use
  • Control: User-agent: Amazonbot

14. DuckDuckGo

DuckAssistBot

  • Purpose: Fetching for DuckAssist answer engine
  • Control: User-agent: DuckAssistBot

15. Diffbot

Diffbot

  • Purpose: ML extraction service; often upstream for AI datasets
  • Control: User-agent: Diffbot

16. Omgili / Omgili Bot

omgili

  • Purpose: Scrapes forums + discussions (used in AI pipelines)
  • Control: User-agent: omgili

17. Timpi (Decentralized Search)

Timpibot / TimpiBot

  • Purpose: Distributed search indexer
  • Notes: Increasingly seen in AI startup stacks.