A complete map of all major AI crawlers powering ChatGPT, Gemini, Claude, Perplexity, Copilot, Apple Intelligence, and more.

Who crawls your site, why, and how it affects your AI search visibility.

1. OpenAI (ChatGPT / GPT-4.1 / GPT-5)

GPTBot

Purpose: Model training data collection
Control: User-agent: GPTBot
Notes: Used for training, not for retrieval.

OAI-SearchBot

Purpose: Fetches content for ChatGPT Search (citations + real-time answers)
Control: User-agent: OAI-SearchBot
Notes: Not used for training; only for search visibility.

ChatGPT-User

Purpose: On-demand real-time fetch when a user asks ChatGPT to load a URL
Control: User-agent: ChatGPT-User
Notes: Behaves like a browser; session-based.

2. Anthropic (Claude)

ClaudeBot

Purpose: Model training; broad web crawling
Control: User-agent: ClaudeBot
Notes: Used for improving Claude’s foundation models.

Claude-User

Purpose: User-triggered URL fetch inside Claude
Control: User-agent: Claude-User
Notes: Not for training; similar to ChatGPT-User.

3. Perplexity

PerplexityBot

Purpose: Indexing + retrieval for real-time answers
Control: User-agent: PerplexityBot
Notes: Known to crawl aggressively; some reports of UA impersonation if blocked.

Perplexity-User

Purpose: On-demand fetching during Q&A
Notes: Not used for training.

4. Google (Gemini, AI Overviews, AI Mode)

Googlebot family

Purpose: Primary crawler for Search (feeds AIO + AI Mode)
Control: User-agent: Googlebot
Notes: All generative experiences depend on standard Googlebot retrieval.

Google-Extended

Purpose: Opt-out token for model training & generative features
Control: User-agent: Google-Extended
Notes: Token, not a crawler. Does not fetch.

5. Apple (Apple Intelligence)

Applebot

Purpose: Indexing for Siri, Spotlight, Apple services
Control: User-agent: Applebot

Applebot-Extended

Purpose: Opt-out for Apple’s model training
Control: User-agent: Applebot-Extended
Notes: Token equivalent to Google-Extended.

6. Microsoft (Bing / Copilot / Edge Assistant)

bingbot

Purpose: Core Bing index (feeds Copilot AI answers)
Control: User-agent: bingbot

7. You.com

YouBot

Purpose: Crawling for You.com’s AI search
Control: User-agent: YouBot

8. Cohere

cohere-training-data-crawler

Purpose: Training crawler
Control: User-agent: cohere-training-data-crawler

cohere-ai

Purpose: On-demand fetcher used by Cohere chat products
Notes: Observed in the wild; mixed behavior.

9. Common Crawl

CCBot

Purpose: Open-source crawl used in many AI model training datasets
Control: User-agent: CCBot
Notes: Major upstream data source for AI companies.

10. Allen Institute (AI2 / Semantic Scholar)

AI2Bot

Purpose: Research crawling; feeds Semantic Scholar
Control: User-agent: AI2Bot

11. Meta

FacebookBot / facebookexternalhit / meta-externalagent

Purpose: Social previews; possible use in Meta AI
Notes: Not directly confirmed as AI retrieval bots.

12. ByteDance (TikTok / Toutiao / CapCut)

Bytespider

Purpose: Wide crawl; supports TikTok/AI content features
Control: User-agent: Bytespider

13. Amazon

Amazonbot

Purpose: Crawling for Amazon properties, potentially AI use
Control: User-agent: Amazonbot

14. DuckDuckGo

DuckAssistBot

Purpose: Fetching for DuckAssist answer engine
Control: User-agent: DuckAssistBot

15. Diffbot

Diffbot

Purpose: ML extraction service; often upstream for AI datasets
Control: User-agent: Diffbot

16. Omgili / Omgili Bot

omgili

Purpose: Scrapes forums + discussions (used in AI pipelines)
Control: User-agent: omgili

17. Timpi (Decentralized Search)

Timpibot / TimpiBot

Purpose: Distributed search indexer
Notes: Increasingly seen in AI startup stacks.

Every AI Crawler You Need to Know in 2026

1. OpenAI (ChatGPT / GPT-4.1 / GPT-5)

2. Anthropic (Claude)

3. Perplexity

4. Google (Gemini, AI Overviews, AI Mode)

5. Apple (Apple Intelligence)

6. Microsoft (Bing / Copilot / Edge Assistant)

7. You.com

8. Cohere

9. Common Crawl

10. Allen Institute (AI2 / Semantic Scholar)

11. Meta

12. ByteDance (TikTok / Toutiao / CapCut)

13. Amazon

14. DuckDuckGo

15. Diffbot

16. Omgili / Omgili Bot

17. Timpi (Decentralized Search)

What are SEO Fraggles? The complete guide for 2026

How to fix “crawled currently not indexed” in Google Search Console

Google Search Console errors: how to identify and fix them

HTTP redirect types and codes: A complete technical guide for SEOs

Content decay: How to identify, fix, and prevent traffic loss

How to fix duplicate content SEO issues: A complete guide for 2026

The most useful RegEx for Google Search Console

SEO for lead generation: A complete strategy guide for 2026

Content cannibalization: What it is and how to fix it in 2026

Content marketing strategies: A complete guide for 2026

Comments

Leave a Reply Cancel reply

Every AI Crawler You Need to Know in 2026

1. OpenAI (ChatGPT / GPT-4.1 / GPT-5)

2. Anthropic (Claude)

3. Perplexity

4. Google (Gemini, AI Overviews, AI Mode)

5. Apple (Apple Intelligence)

6. Microsoft (Bing / Copilot / Edge Assistant)

7. You.com

8. Cohere

9. Common Crawl

10. Allen Institute (AI2 / Semantic Scholar)

11. Meta

12. ByteDance (TikTok / Toutiao / CapCut)

13. Amazon

14. DuckDuckGo

15. Diffbot

16. Omgili / Omgili Bot

17. Timpi (Decentralized Search)

Related articles

Comments

Leave a Reply Cancel reply