AI Search / GEO

How to Block — or Allow — AI Crawlers in robots.txt

Every major AI engine fetches the web with a named user-agent, and a few lines in robots.txt decide whether they get in. This guide gives you a copy-paste template for blocking them, explains the visibility you give up when you do, and shows how to allow the ones you actually want. First, see exactly which AI bots your site allows or blocks right now:

The AI crawlers worth knowing

There is no single "AI bot" to allow or block. Each company runs its own crawler with its own user-agent, and they do different jobs. OpenAI uses GPTBot for training and retrieval, OAI-SearchBot for the ChatGPT Search index, and ChatGPT-User when someone asks ChatGPT to open a link live. Anthropic runs ClaudeBot, Perplexity runs PerplexityBot, and Google splits its crawling so that Googlebot handles search and AI Overviews while Google-Extended separately governs Gemini training. Below those sit the dataset crawlers — CCBot from Common Crawl, which feeds many models at once, plus Bytespider, Meta-ExternalAgent, Amazonbot and Applebot-Extended.

Blocking them: the template

robots.txt controls these bots by user-agent. To shut a crawler out completely, give it its own group and disallow the whole site. Here is a template that blocks the common AI crawlers while leaving normal search untouched:

# Block AI training & answer-engine crawlers
User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: Google-Extended
User-agent: CCBot
User-agent: Bytespider
Disallow: /

# Keep normal search working
User-agent: Googlebot
Allow: /

User-agent: *
Allow: /

Note that Googlebot is left allowed on purpose. Blocking it would pull you out of Google Search itself, which is almost never what people mean when they say they want to "block AI."

The trade-off nobody mentions up front

Blocking is easy; the consequence is the part to think through. When you disallow the answer-engine crawlers, you are not just opting out of model training — you are opting out of being cited in AI search. An engine cannot quote, summarise or link a page it was never allowed to fetch. So if a customer asks ChatGPT or Perplexity for "the best tool for X" and you have blocked their crawlers, you simply are not in the running, while your competitors who left the door open are.

That is a legitimate choice for some publishers who would rather protect their content than be summarised. But it should be a decision you make on purpose, not a default you inherited from a robots.txt template you copied years ago. A surprising number of sites are blocking AI engines without realising it.

Allowing them: usually the better default

If your goal is visibility — and for most marketing sites it is — the right move is to make sure none of the answer-engine crawlers are disallowed. There is rarely anything to add; you simply remove any rule that blocks them. Allowing access is only the first step, though. Those crawlers also need to actually read your pages, which means your content has to be present in the HTML rather than painted in by JavaScript, and it helps enormously to mark pages up with schema.

The AI readiness checker shows all of this in one pass — which bots are allowed or blocked, whether your homepage renders without JavaScript, and whether your structured data is in place. For the wider llms.txt question, the llms.txt guide covers what that file does and does not do, and the robots.txt tester helps you check individual rules.

Frequently asked questions

How do I block AI crawlers like GPTBot or ClaudeBot?

Add a group to your robots.txt that names the user-agent and disallows everything, for example "User-agent: GPTBot" followed by "Disallow: /". Repeat for each bot you want to keep out — GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot and so on. The major AI crawlers honour these rules.

Does blocking AI crawlers hurt my SEO?

It does not affect classic Google rankings, because Googlebot is separate from Google-Extended. But blocking the answer-engine crawlers removes you from AI search surfaces like ChatGPT Search, Perplexity and, where Google-Extended is involved, parts of Gemini. You are trading AI visibility for content control, so it should be a deliberate decision.

What is the difference between Google-Extended and Googlebot?

Googlebot crawls for Google Search and feeds AI Overviews; blocking it removes you from Google entirely, which almost nobody wants. Google-Extended is a separate control that governs whether your content is used for Gemini grounding and model training. You can allow Googlebot and disallow Google-Extended if you want normal search but not AI training.

Will AI crawlers actually obey robots.txt?

The major, named crawlers — GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, Applebot-Extended, Bytespider and others — publicly state they respect robots.txt user-agent rules. Less reputable scrapers may ignore it, which is a separate problem robots.txt cannot solve on its own.

Link exchange