AI Crawlers, robots.txt, and Content Signals

Not all AI crawlers are the same. If you want to clearly separate visibility, training, and user-triggered requests, you need more than just a knee-jerk robots.txt block.

This article was last updated on June 18, 2026.

info
Written by Saskia Teichmann
on June 18, 2026
Sending
User Review
0 (0 votes)
Comments Rating 0 (0 reviews)
Humorvolles 1950er-Jahre-Werbeplakat zu AI-Crawlern, robots.txt und sauber getrennten Bot-Zwecken.

As of June 2026. As soon as website operators hear about AI crawlers, one of two things often happens: Either everything is blocked immediately because „the AI shouldn’t just steal everything.“ Or everything remains open because visibility sounds like a good idea. Both approaches are too simplistic.

The better approach is less dramatic and much more useful: First, understand what each bot does. Search, training, user-triggered retrieval, ad verification, testing tools, and audit crawls are not the same thing. If you lump everything together, you’ll either unnecessarily reduce your visibility or leave things unchecked that actually need to be monitored.

The Summary

  • robots.txt controls crawling, not visibility. A blocked URL can still appear in search results if it is linked to from an external source.
  • robots.txt is not a privacy shield. Private content should be protected by a login, password, or stored on non-public systems—not just by a "disallow" rule.
  • AI crawlers have different tasks. Search bots, training bots, and user-triggered fetches must be evaluated separately.
  • Google-Extended is not a separate, visible crawler. It is a control token in robots.txt and, according to Google, affects Gemini training and grounding, not Google Search rankings.
  • Blocking search bots can result in a loss of AI visibility. Those who allow training bots are making a different choice. It is precisely this distinction that is important.
  • Content signals remain crucial. Clear content, good structure, clean schema data, sitemaps, internal links, and machine-readable versions are more helpful than frantic "bot bingo.".

My recommendation: Treat robots.txt like a door sign, not a safe. For visibility, you need accessibility and positive signals. For protection, you need real access control. These are two different areas to focus on.

What robots.txt Actually Does

The file robots.txt is located in the root directory of your website, for example at https://example.com/robots.txt. Reputable crawlers read it before fetching pages. It specifies which areas a particular user-agent is allowed to crawl and which it is not.

Google describes robots.txt in very straightforward terms: The file tells search engine crawlers which URLs they are allowed to access. Its main purpose is to control crawler traffic so that servers aren't unnecessarily overloaded. It is not intended to keep websites out of Google.

That may sound like a small thing, but it’s half the battle. robots.txt is a crawling rule. It answers the question: „Is this bot allowed to access this URL?“ It does not automatically answer: „Is this URL allowed to appear in search results?“ And it certainly does not answer: „Is this content private?“

What robots.txt Doesn't Do

The most common mistake is confusing Do not crawl, Do not index, Do not display and Do not use. Those are different goals.

  • Do not crawl: A bot should not retrieve a URL. That's what robots.txt is for.
  • Do not index: You don't want a URL to appear in search results. To do that, you usually need to noindex or actual distance.
  • Not open to the public: Content should remain private. To do this, you'll need a login, password protection, access control, or a non-public storage location.
  • Do not use for training: However, some providers have their own user-agent tokens, for example GPTBot, ClaudeBot or Google Extended.
  • Do not appear in AI Search: However, search bots are relevant for some providers, for example OAI-SearchBot, Claude-SearchBot or PerplexityBot.

What’s particularly tricky is that, according to Google’s own documentation, if you block a page via robots.txt, Google may still find the URL if other pages link to it. In that case, the URL may appear in search results without a snippet. That’s usually not what website owners want.

If something really shouldn't be public, it doesn't just belong in robots.txt. It belongs behind access control. Period. robots.txt is a guideline for polite crawlers, not a security system.

Four cases that must be clearly distinguished

For AI Visibility, distinguishing between bot purposes has become more important than the individual bot name. In practice, there are four scenarios:

CaseWhat It's All AboutA typical decision
Search CrawlerContent is found and linked for search or response interfaces.Generally allow public, important content.
Training CrawlerData can be collected for model training or model improvement.Making a conscious decision—often more restrictive than a search.
User-triggered fetchA person asks an AI system to retrieve a specific URL or source.Don't block things out on autopilot, but protect sensitive areas.
Tool, audit, or product crawlerA service checks, renders, tests, or analyzes pages on behalf of a client.Only allow it if the purpose and source are plausible.

This is exactly where things will get more interesting in 2026 than they used to be. In the past, robots.txt was primarily a secondary SEO consideration for many WordPress websites. Today, that same file can influence whether content is more easily discoverable in ChatGPT Search, Claude Search, Perplexity, or Gemini-like features; whether it’s made available for training; and whether WAFs accidentally block legitimate AI bots.

Googlebot, Google-Extended, and Google-CloudVertexBot

At Google, this distinction is particularly important because many debates here are surprisingly vague.

Googlebot is the classic Google crawler for Google Search. Rules for Googlebot According to Google, this affects Google Search—including its search features—as well as other platforms such as Discover, Google Images, Google Video, and Google News. So if you block Googlebot across the board, you’re not just blocking „AI,“ but also your normal visibility on Google.

Google Extended is something else. According to Google, Google Extended does not have its own HTTP user-agent. Crawling is performed using existing Google user-agents; Google Extended is a robots.txt token used for control purposes. It is intended to allow publishers to control whether content already crawled by Google may be used to train future Gemini models and for grounding in Gemini Apps and Vertex AI. Google also explicitly states that Google-Extended neither influences inclusion in Google Search nor is used as a ranking signal in Google Search.

Google Cloud VertexBot According to Google’s documentation, this refers to crawls that website operators initiate to build Vertex AI agents. This, too, has no effect on Google Search. If an organization builds its own agents using Vertex AI, that bot may be relevant. For a typical WordPress blog, however, it’s not the key factor that determines its visibility on Google.

The practical takeaway: Google isn’t just a single AI switch. Googlebot, Google-Extended, and Google-CloudVertexBot all have different meanings. If you shut everything down across the board out of frustration, you’ll also be making decisions that affect classic search, images, news, Gemini usage, and agent workflows. This isn’t something you should do on a whim.

OpenAI: OAI-SearchBot, GPTBot, and ChatGPT Users

OpenAI distinguishes between these purposes fairly clearly in its own crawler documentation.

  • OAI-SearchBot: for ChatGPT Search. According to OpenAI, users who block this bot may be excluded from ChatGPT search results, even though navigation links may still be available.
  • GPTBot: for content that can be used to train generative foundation models. A "Disallow" directive for GPTBot indicates that the content should not be used for training.
  • ChatGPT User: for certain user actions in ChatGPT and Custom GPTs. These requests are user-triggered, not automated web crawling. OpenAI therefore notes that robots.txt rules do not always apply in these cases.

This is a pretty important distinction for website owners. For example, you might say: “I want my content to be discoverable in ChatGPT Search, but I don’t want to make it available for training.” In that case, a possible template would be: OAI-SearchBot allow, GPTBot block. Whether this is the right strategic move depends on your website, your content, and your risk tolerance. But at least it's a clear-cut decision.

What you should avoid: blocking all OpenAI user agents across the board and then wondering why your public expertise doesn’t show up in ChatGPT Search. You can’t say „Please find me“ and „Please never show me“ at the same time and expect that to result in reliable visibility.

Claude, Perplexity, and Other AI Crawlers

Anthropic also describes several bots for Claude: ClaudeBot for model training or model improvement, Claude-SearchBot for search quality and Claude User for user-triggered queries. According to Anthropic, blocking Claude-SearchBot can reduce the visibility and accuracy of Claude search results. Blocking ClaudeBot, on the other hand, signals that future content should not be included in training datasets.

Perplexity separates PerplexityBot for Search and Perplexity User for user actions. In addition, Perplexity points out that WAF rules should not just blindly check user-agent strings, but ideally also take official IP ranges into account. This is a small detail, but an important one: Anyone can claim to have a particular user-agent string. For reliable bot detection, fancy names in the log aren’t enough.

And then there are many other types of requests: SEO tools, monitoring services, preview bots, social bots, security scanners, fraudulent scrapers, internal crawlers, and hosting checks. Not every bot with „AI“ in its name is strategically important. Not every unknown bot is harmless. So the task isn’t to memorize a massive list, but to clearly define your own goals.

Content Signals Instead of Reflex Blockage

AI Visibility isn't just about who's allowed to crawl. It’s also about what a system finds when it crawls. A website can be technically accessible yet still difficult to understand. In that case, it’s like a store with an open door but no signage, price tags, or lighting. Very accessible, but not very helpful.

In its own guide to generative AI search, Google states that basic SEO work still matters: helpful, unique, well-organized content that doesn’t just recycle what’s already out there. That’s exactly where the key lies. AI systems need not only access, but also actionable signals.

  • Clear page structure: Descriptive titles, meaningful subheadings, and clear paragraphs.
  • Clean Entities: Who is the person, organization, brand, service, or product?
  • Quotable statements: Specific answers, clear definitions, data, examples, and limitations.
  • Current status: Visible publication and update dates, well-maintained content, no "zombie" guides from 2018.
  • Schema data: not as "ranking magic," but as a machine-readable link between content, author, organization, and product.
  • Sitemaps: so that important content remains easy to find and doesn't get lost in the maze of the archive.
  • Internal links: Clusters, pillars, FAQs, product pages, and guides should all explain one another.
  • Machine-readable versions: llms.txt, Markdown or other simplified versions can provide context. However, they are no substitute for an access policy.

This also serves as a bridge to the previous article about Schema, Entities, and Citable Content. If a bot is allowed to crawl but finds only conflicting signals, there’s little to gain. If it’s allowed to crawl and finds clear signals, that access at least becomes a viable opportunity.

WordPress Checklist

For WordPress websites, I would take the following pragmatic approach:

  1. Clarify public goals: What content should be searchable on Google, ChatGPT Search, Claude, Perplexity, and other response systems?
  2. Truly protecting private content: Customer data, internal documents, staging areas, and unreleased downloads should be protected by a login or password—not just in robots.txt.
  3. Decide on training separately: Do you want to allow, block, or handle training crawlers differently?
  4. Don't accidentally block search bots: If AI Search is a target, check whether search bots such as OAI-SearchBot, Claude-SearchBot or PerplexityBot are available.
  5. Do not damage Googlebot: Don't block Googlebot across the board if maintaining normal visibility on Google is important.
  6. Don't block CSS, JavaScript, and images unnecessarily: If a page becomes difficult to understand without resources, you also make it harder for machines to categorize it.
  7. noindex use strategically: It’s better to properly set “noindex” on tag archives, thin search pages, internal thank-you pages, and duplicate content than to half-heartedly hide them using robots.txt.
  8. Check sitemaps: Are important posts, pages, products, categories, and media included correctly? Have unimportant sections been excluded?
  9. Check the schema: Are there multiple SEO plugins, e-commerce plugins, or AI plugins that generate conflicting JSON-LD graphs?
  10. Monitor logs: Which bots actually get through? Which ones are blocked by firewalls, caches, security plugins, or hosting rules?
  11. Organizing llms.txt and Markdown: Use them as a context and orientation layer, not for rights management.
  12. Document changes: robots.txt rules can affect visibility. That's why they belong in a change log, not in a spur-of-the-moment Friday-night decision.

A Useful robots.txt Example

This isn't a one-size-fits-all template to copy, but rather a thought experiment. For many public advice, service, or product websites, a more nuanced structure may make more sense than „all open“ or „all closed.“.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Sitemap: https://example.com/sitemap_index.xml

What this example illustrates: Traditional crawlers and search bots are allowed to find public content. Training tokens are treated more restrictively. Whether this is the right approach for your website depends on what you publish. A photographer, a legal publisher, a SaaS provider, a WooCommerce store, and a local trades business do not automatically have the same bot policy.

It’s also important to note that some providers distinguish between automated crawling and user-triggered requests. That’s exactly why robots.txt isn’t the only layer of control. WAF rules, IP verification, login protection, consent and data protection issues, server logs, and content strategy all need to be taken into account.

My Thoughts on citelayer®

From my citelayer® AI Visibility AuditFrom this perspective, robots.txt is only part of the diagnosis. I don’t just want to know whether a bot is theoretically allowed to access the site. I want to know what happens in practice: Do relevant bots actually reach the site? Are they blocked by firewall rules? Do they see the correct content? Do the sitemap, schema, canonicals, internal links, llms.txt, and visible content all align?

Especially with WordPress, I often don’t see a single major problem, but rather many small inconsistencies: the SEO plugin says A, the shop plugin says B, the security plugin blocks C, the cache returns D, and the robots.txt file still contains an old entry from a long-forgotten migration. That’s nothing spectacular. Unfortunately, it’s exactly the kind of mess that causes automated classification to fail.

citelayer® for WordPress It bridges precisely this gap between traditional SEO plugins and AI Visibility: machine-readable context layers, llms.txt, Schema context, bot signals, and a better foundation for audits. But the same principle applies here: A plugin can provide structure. The strategic decision regarding which content should be visible, citable, protected, or excluded from training remains an editorial and business decision.

FAQ

Should I block all AI crawlers?

Not across the board. If you want to be visible in AI Search, you shouldn't reflexively block search crawlers. You can evaluate training crawlers separately. Regardless of that, private content should be subject to proper access controls.

Is robots.txt legally binding?

robots.txt is a technical standard—or rather, a convention—governing crawler behavior; it is not a safe or a source of legal advice. Reputable crawlers respect the rules. Others may ignore them. If legal issues are a concern, you’ll need additional legal review and genuine technical safeguards.

What is the difference between GPTBot and OAI-SearchBot?

OpenAI describes GPTBot as a crawler for content that can be used to train generative foundation models. OAI-SearchBot, on the other hand, is designed for ChatGPT Search. So, in theory, you can allow Search while blocking training.

Does Google Extended affect my Google ranking?

According to Google, no. According to Google's documentation, Google Extended does not affect inclusion in Google Search or ranking in Google Search. It controls whether content crawled by Google may be used for certain Gemini and Vertex AI applications.

Does llms.txt replace my robots.txt?

No. robots.txt controls crawling rules. llms.txt serves as a guidance layer for AI systems and agents: important pages, context, summaries, machine-readable entry points. One is more about „where are you allowed to go?“, while the other is more about „this is what’s important here.“.

Why should I check bot logs?

Because robots.txt only indicates your intent. Logs show what actually happens: which bots visit your site, which URLs they access, which status codes they receive, which firewall rules are triggered, and which important content is never reached.

Sources and Verification

<span class="castledown-font">Saskia Teichmann</span>

Saskia Teichmann

Saskia Teichmann is a certified AI strategist (MMAI®) and full stack web developer. She supports SMEs and industry in integrating AI, GDPR, the EU AI Regulation and modern web technologies into a future-proof, legally compliant digital strategy.

To put it simply:
As a technical reality translator, she works at the interface of AI, web development and operational reality. She develops AI-supported workflows for companies and agencies - with the aim of ensuring that technology not only impresses in the demo, but also works in everyday life.

Submit a project requestServing coffee

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Sending