{"id":3795,"date":"2026-06-18T17:17:34","date_gmt":"2026-06-18T15:17:34","guid":{"rendered":"https:\/\/isla-stud.io\/?p=3795"},"modified":"2026-06-18T17:19:54","modified_gmt":"2026-06-18T15:19:54","slug":"ai-crawlers-robots-txt-content-signals","status":"publish","type":"post","link":"https:\/\/isla-stud.io\/en\/ai-visibility\/ai-crawler-robots-txt-content-signale\/","title":{"rendered":"AI Crawlers, robots.txt, and Content Signals"},"content":{"rendered":"<p class=\"wp-block-paragraph\"><strong>As of June 2026.<\/strong> As soon as website operators hear about AI crawlers, one of two things often happens: Either everything is blocked immediately because \u201ethe AI shouldn\u2019t just steal everything.\u201c Or everything remains open because visibility sounds like a good idea. Both approaches are too simplistic.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The better approach is less dramatic and much more useful: First, understand what each bot does. Search, training, user-triggered retrieval, ad verification, testing tools, and audit crawls are not the same thing. If you lump everything together, you\u2019ll either unnecessarily reduce your visibility or leave things unchecked that actually need to be monitored.<\/p>\n\n\n\n<div class=\"wp-block-rank-math-toc-block\" id=\"rank-math-toc\"><h2>Table of contents<\/h2><nav><ul><li><a href=\"#kurzfassung\">The Summary<\/a><\/li><li><a href=\"#robots-txt\">What robots.txt Actually Does<\/a><\/li><li><a href=\"#nicht-macht\">What robots.txt Doesn't Do<\/a><\/li><li><a href=\"#vier-faelle\">Four cases that must be clearly distinguished<\/a><\/li><li><a href=\"#google\">Googlebot, Google-Extended, and Google-CloudVertexBot<\/a><\/li><li><a href=\"#openai\">OpenAI: OAI-SearchBot, GPTBot, and ChatGPT Users<\/a><\/li><li><a href=\"#claude-perplexity\">Claude, Perplexity, and Other AI Crawlers<\/a><\/li><li><a href=\"#content-signale\">Content Signals Instead of Reflex Blockage<\/a><\/li><li><a href=\"#wordpress-checkliste\">WordPress Checklist<\/a><\/li><li><a href=\"#beispiel\">A Useful robots.txt Example<\/a><\/li><li><a href=\"#citelayer\">My Thoughts on citelayer\u00ae<\/a><\/li><li><a href=\"#faq\">FAQ<\/a><\/li><li><a href=\"#quellen\">Sources and Verification<\/a><\/li><\/ul><\/nav><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"kurzfassung\">The Summary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>robots.txt controls crawling, not visibility.<\/strong> A blocked URL can still appear in search results if it is linked to from an external source.<\/li>\n<li><strong>robots.txt is not a privacy shield.<\/strong> Private content should be protected by a login, password, or stored on non-public systems\u2014not just by a \"disallow\" rule.<\/li>\n<li><strong>AI crawlers have different tasks.<\/strong> Search bots, training bots, and user-triggered fetches must be evaluated separately.<\/li>\n<li><strong>Google-Extended is not a separate, visible crawler.<\/strong> It is a control token in robots.txt and, according to Google, affects Gemini training and grounding, not Google Search rankings.<\/li>\n<li><strong>Blocking search bots can result in a loss of AI visibility.<\/strong> Those who allow training bots are making a different choice. It is precisely this distinction that is important.<\/li>\n<li><strong>Content signals remain crucial.<\/strong> Clear content, good structure, clean schema data, sitemaps, internal links, and machine-readable versions are more helpful than frantic \"bot bingo.\".<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">My recommendation: Treat robots.txt like a door sign, not a safe. For visibility, you need accessibility and positive signals. For protection, you need real access control. These are two different areas to focus on.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"robots-txt\">What robots.txt Actually Does<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The file <code>robots.txt<\/code> is located in the root directory of your website, for example at <code>https:\/\/example.com\/robots.txt<\/code>. Reputable crawlers read it before fetching pages. It specifies which areas a particular user-agent is allowed to crawl and which it is not.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Google describes robots.txt in very straightforward terms: The file tells search engine crawlers which URLs they are allowed to access. Its main purpose is to control crawler traffic so that servers aren't unnecessarily overloaded. It is not intended to keep websites out of Google.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">That may sound like a small thing, but it\u2019s half the battle. robots.txt is a crawling rule. It answers the question: \u201eIs this bot allowed to access this URL?\u201c It does not automatically answer: \u201eIs this URL allowed to appear in search results?\u201c And it certainly does not answer: \u201eIs this content private?\u201c<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"nicht-macht\">What robots.txt Doesn't Do<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The most common mistake is confusing <em>Do not crawl<\/em>, <em>Do not index<\/em>, <em>Do not display<\/em> and <em>Do not use<\/em>. Those are different goals.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Do not crawl:<\/strong> A bot should not retrieve a URL. That's what robots.txt is for.<\/li>\n<li><strong>Do not index:<\/strong> You don't want a URL to appear in search results. To do that, you usually need to <code>noindex<\/code> or actual distance.<\/li>\n<li><strong>Not open to the public:<\/strong> Content should remain private. To do this, you'll need a login, password protection, access control, or a non-public storage location.<\/li>\n<li><strong>Do not use for training:<\/strong> However, some providers have their own user-agent tokens, for example <code>GPTBot<\/code>, <code>ClaudeBot<\/code> or <code>Google Extended<\/code>.<\/li>\n<li><strong>Do not appear in AI Search:<\/strong> However, search bots are relevant for some providers, for example <code>OAI-SearchBot<\/code>, <code>Claude-SearchBot<\/code> or <code>PerplexityBot<\/code>.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">What\u2019s particularly tricky is that, according to Google\u2019s own documentation, if you block a page via robots.txt, Google may still find the URL if other pages link to it. In that case, the URL may appear in search results without a snippet. That\u2019s usually not what website owners want.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If something really shouldn't be public, it doesn't just belong in robots.txt. It belongs behind access control. Period. robots.txt is a guideline for polite crawlers, not a security system.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"vier-faelle\">Four cases that must be clearly distinguished<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">For AI Visibility, distinguishing between bot purposes has become more important than the individual bot name. In practice, there are four scenarios:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Case<\/th><th>What It's All About<\/th><th>A typical decision<\/th><\/tr><\/thead><tbody><tr><td>Search Crawler<\/td><td>Content is found and linked for search or response interfaces.<\/td><td>Generally allow public, important content.<\/td><\/tr><tr><td>Training Crawler<\/td><td>Data can be collected for model training or model improvement.<\/td><td>Making a conscious decision\u2014often more restrictive than a search.<\/td><\/tr><tr><td>User-triggered fetch<\/td><td>A person asks an AI system to retrieve a specific URL or source.<\/td><td>Don't block things out on autopilot, but protect sensitive areas.<\/td><\/tr><tr><td>Tool, audit, or product crawler<\/td><td>A service checks, renders, tests, or analyzes pages on behalf of a client.<\/td><td>Only allow it if the purpose and source are plausible.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This is exactly where things will get more interesting in 2026 than they used to be. In the past, robots.txt was primarily a secondary SEO consideration for many WordPress websites. Today, that same file can influence whether content is more easily discoverable in ChatGPT Search, Claude Search, Perplexity, or Gemini-like features; whether it\u2019s made available for training; and whether WAFs accidentally block legitimate AI bots.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"google\">Googlebot, Google-Extended, and Google-CloudVertexBot<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">At Google, this distinction is particularly important because many debates here are surprisingly vague.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Googlebot<\/strong> is the classic Google crawler for Google Search. Rules for <code>Googlebot<\/code> According to Google, this affects Google Search\u2014including its search features\u2014as well as other platforms such as Discover, Google Images, Google Video, and Google News. So if you block Googlebot across the board, you\u2019re not just blocking \u201eAI,\u201c but also your normal visibility on Google.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Google Extended<\/strong> is something else. According to Google, Google Extended does not have its own HTTP user-agent. Crawling is performed using existing Google user-agents; <code>Google Extended<\/code> is a robots.txt token used for control purposes. It is intended to allow publishers to control whether content already crawled by Google may be used to train future Gemini models and for grounding in Gemini Apps and Vertex AI. Google also explicitly states that Google-Extended neither influences inclusion in Google Search nor is used as a ranking signal in Google Search.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Google Cloud VertexBot<\/strong> According to Google\u2019s documentation, this refers to crawls that website operators initiate to build Vertex AI agents. This, too, has no effect on Google Search. If an organization builds its own agents using Vertex AI, that bot may be relevant. For a typical WordPress blog, however, it\u2019s not the key factor that determines its visibility on Google.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The practical takeaway: Google isn\u2019t just a single AI switch. Googlebot, Google-Extended, and Google-CloudVertexBot all have different meanings. If you shut everything down across the board out of frustration, you\u2019ll also be making decisions that affect classic search, images, news, Gemini usage, and agent workflows. This isn\u2019t something you should do on a whim.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"openai\">OpenAI: OAI-SearchBot, GPTBot, and ChatGPT Users<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">OpenAI distinguishes between these purposes fairly clearly in its own crawler documentation.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><code>OAI-SearchBot<\/code>:<\/strong> for ChatGPT Search. According to OpenAI, users who block this bot may be excluded from ChatGPT search results, even though navigation links may still be available.<\/li>\n<li><strong><code>GPTBot<\/code>:<\/strong> for content that can be used to train generative foundation models. A \"Disallow\" directive for GPTBot indicates that the content should not be used for training.<\/li>\n<li><strong><code>ChatGPT User<\/code>:<\/strong> for certain user actions in ChatGPT and Custom GPTs. These requests are user-triggered, not automated web crawling. OpenAI therefore notes that robots.txt rules do not always apply in these cases.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This is a pretty important distinction for website owners. For example, you might say: \u201cI want my content to be discoverable in ChatGPT Search, but I don\u2019t want to make it available for training.\u201d In that case, a possible template would be: <code>OAI-SearchBot<\/code> allow, <code>GPTBot<\/code> block. Whether this is the right strategic move depends on your website, your content, and your risk tolerance. But at least it's a clear-cut decision.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What you should avoid: blocking all OpenAI user agents across the board and then wondering why your public expertise doesn\u2019t show up in ChatGPT Search. You can\u2019t say \u201ePlease find me\u201c and \u201ePlease never show me\u201c at the same time and expect that to result in reliable visibility.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"claude-perplexity\">Claude, Perplexity, and Other AI Crawlers<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Anthropic also describes several bots for Claude: <code>ClaudeBot<\/code> for model training or model improvement, <code>Claude-SearchBot<\/code> for search quality and <code>Claude User<\/code> for user-triggered queries. According to Anthropic, blocking Claude-SearchBot can reduce the visibility and accuracy of Claude search results. Blocking ClaudeBot, on the other hand, signals that future content should not be included in training datasets.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Perplexity separates <code>PerplexityBot<\/code> for Search and <code>Perplexity User<\/code> for user actions. In addition, Perplexity points out that WAF rules should not just blindly check user-agent strings, but ideally also take official IP ranges into account. This is a small detail, but an important one: Anyone can claim to have a particular user-agent string. For reliable bot detection, fancy names in the log aren\u2019t enough.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And then there are many other types of requests: SEO tools, monitoring services, preview bots, social bots, security scanners, fraudulent scrapers, internal crawlers, and hosting checks. Not every bot with \u201eAI\u201c in its name is strategically important. Not every unknown bot is harmless. So the task isn\u2019t to memorize a massive list, but to clearly define your own goals.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"content-signale\">Content Signals Instead of Reflex Blockage<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">AI Visibility isn't just about who's allowed to crawl. It\u2019s also about what a system finds when it crawls. A website can be technically accessible yet still difficult to understand. In that case, it\u2019s like a store with an open door but no signage, price tags, or lighting. Very accessible, but not very helpful.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In its own guide to generative AI search, Google states that basic SEO work still matters: helpful, unique, well-organized content that doesn\u2019t just recycle what\u2019s already out there. That\u2019s exactly where the key lies. AI systems need not only access, but also actionable signals.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Clear page structure:<\/strong> Descriptive titles, meaningful subheadings, and clear paragraphs.<\/li>\n<li><strong>Clean Entities:<\/strong> Who is the person, organization, brand, service, or product?<\/li>\n<li><strong>Quotable statements:<\/strong> Specific answers, clear definitions, data, examples, and limitations.<\/li>\n<li><strong>Current status:<\/strong> Visible publication and update dates, well-maintained content, no \"zombie\" guides from 2018.<\/li>\n<li><strong>Schema data:<\/strong> not as \"ranking magic,\" but as a machine-readable link between content, author, organization, and product.<\/li>\n<li><strong>Sitemaps:<\/strong> so that important content remains easy to find and doesn't get lost in the maze of the archive.<\/li>\n<li><strong>Internal links:<\/strong> Clusters, pillars, FAQs, product pages, and guides should all explain one another.<\/li>\n<li><strong>Machine-readable versions:<\/strong> <a href=\"https:\/\/isla-stud.io\/en\/advisor\/llms-txt-wordpress\/\">llms.txt<\/a>, Markdown or other simplified versions can provide context. However, they are no substitute for an access policy.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This also serves as a bridge to the previous article about <a href=\"https:\/\/isla-stud.io\/en\/advisor\/schema-entities-quotable-content\/\">Schema, Entities, and Citable Content<\/a>. If a bot is allowed to crawl but finds only conflicting signals, there\u2019s little to gain. If it\u2019s allowed to crawl and finds clear signals, that access at least becomes a viable opportunity.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"wordpress-checkliste\">WordPress Checklist<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">For WordPress websites, I would take the following pragmatic approach:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Clarify public goals:<\/strong> What content should be searchable on Google, ChatGPT Search, Claude, Perplexity, and other response systems?<\/li>\n<li><strong>Truly protecting private content:<\/strong> Customer data, internal documents, staging areas, and unreleased downloads should be protected by a login or password\u2014not just in robots.txt.<\/li>\n<li><strong>Decide on training separately:<\/strong> Do you want to allow, block, or handle training crawlers differently?<\/li>\n<li><strong>Don't accidentally block search bots:<\/strong> If AI Search is a target, check whether search bots such as <code>OAI-SearchBot<\/code>, <code>Claude-SearchBot<\/code> or <code>PerplexityBot<\/code> are available.<\/li>\n<li><strong>Do not damage Googlebot:<\/strong> Don't block Googlebot across the board if maintaining normal visibility on Google is important.<\/li>\n<li><strong>Don't block CSS, JavaScript, and images unnecessarily:<\/strong> If a page becomes difficult to understand without resources, you also make it harder for machines to categorize it.<\/li>\n<li><strong><code>noindex<\/code> use strategically:<\/strong> It\u2019s better to properly set \u201cnoindex\u201d on tag archives, thin search pages, internal thank-you pages, and duplicate content than to half-heartedly hide them using robots.txt.<\/li>\n<li><strong>Check sitemaps:<\/strong> Are important posts, pages, products, categories, and media included correctly? Have unimportant sections been excluded?<\/li>\n<li><strong>Check the schema:<\/strong> Are there multiple SEO plugins, e-commerce plugins, or AI plugins that generate conflicting JSON-LD graphs?<\/li>\n<li><strong>Monitor logs:<\/strong> Which bots actually get through? Which ones are blocked by firewalls, caches, security plugins, or hosting rules?<\/li>\n<li><strong>Organizing llms.txt and Markdown:<\/strong> Use them as a context and orientation layer, not for rights management.<\/li>\n<li><strong>Document changes:<\/strong> robots.txt rules can affect visibility. That's why they belong in a change log, not in a spur-of-the-moment Friday-night decision.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"beispiel\">A Useful robots.txt Example<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This isn't a one-size-fits-all template to copy, but rather a thought experiment. For many public advice, service, or product websites, a more nuanced structure may make more sense than \u201eall open\u201c or \u201eall closed.\u201c.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>User-agent: *\nDisallow: \/wp-admin\/\nAllow: \/wp-admin\/admin-ajax.php\n\nUser-agent: GPTBot\nDisallow: \/\n\nUser-agent: ClaudeBot\nDisallow: \/\n\nUser-agent: Google-Extended\nDisallow: \/\n\nUser-agent: OAI-SearchBot\nAllow: \/\n\nUser-agent: Claude-SearchBot\nAllow: \/\n\nUser-agent: PerplexityBot\nAllow: \/\n\nSitemap: https:\/\/example.com\/sitemap_index.xml<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">What this example illustrates: Traditional crawlers and search bots are allowed to find public content. Training tokens are treated more restrictively. Whether this is the right approach for your website depends on what you publish. A photographer, a legal publisher, a SaaS provider, a WooCommerce store, and a local trades business do not automatically have the same bot policy.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It\u2019s also important to note that some providers distinguish between automated crawling and user-triggered requests. That\u2019s exactly why robots.txt isn\u2019t the only layer of control. WAF rules, IP verification, login protection, consent and data protection issues, server logs, and content strategy all need to be taken into account.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"citelayer\">My Thoughts on citelayer\u00ae<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">From my <a href=\"https:\/\/citelayer-ai.com\/services\/ai-visibility-audit\/\" target=\"_blank\" rel=\"noopener\">citelayer\u00ae AI Visibility Audit<\/a>From this perspective, robots.txt is only part of the diagnosis. I don\u2019t just want to know whether a bot is theoretically allowed to access the site. I want to know what happens in practice: Do relevant bots actually reach the site? Are they blocked by firewall rules? Do they see the correct content? Do the sitemap, schema, canonicals, internal links, llms.txt, and visible content all align?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Especially with WordPress, I often don\u2019t see a single major problem, but rather many small inconsistencies: the SEO plugin says A, the shop plugin says B, the security plugin blocks C, the cache returns D, and the robots.txt file still contains an old entry from a long-forgotten migration. That\u2019s nothing spectacular. Unfortunately, it\u2019s exactly the kind of mess that causes automated classification to fail.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/citelayer.ai\/\" target=\"_blank\" rel=\"noopener\">citelayer\u00ae for WordPress<\/a> It bridges precisely this gap between traditional SEO plugins and AI Visibility: machine-readable context layers, llms.txt, Schema context, bot signals, and a better foundation for audits. But the same principle applies here: A plugin can provide structure. The strategic decision regarding which content should be visible, citable, protected, or excluded from training remains an editorial and business decision.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"faq\">FAQ<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Should I block all AI crawlers?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Not across the board. If you want to be visible in AI Search, you shouldn't reflexively block search crawlers. You can evaluate training crawlers separately. Regardless of that, private content should be subject to proper access controls.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Is robots.txt legally binding?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">robots.txt is a technical standard\u2014or rather, a convention\u2014governing crawler behavior; it is not a safe or a source of legal advice. Reputable crawlers respect the rules. Others may ignore them. If legal issues are a concern, you\u2019ll need additional legal review and genuine technical safeguards.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What is the difference between GPTBot and OAI-SearchBot?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">OpenAI describes GPTBot as a crawler for content that can be used to train generative foundation models. OAI-SearchBot, on the other hand, is designed for ChatGPT Search. So, in theory, you can allow Search while blocking training.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Does Google Extended affect my Google ranking?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">According to Google, no. According to Google's documentation, Google Extended does not affect inclusion in Google Search or ranking in Google Search. It controls whether content crawled by Google may be used for certain Gemini and Vertex AI applications.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Does llms.txt replace my robots.txt?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">No. robots.txt controls crawling rules. llms.txt serves as a guidance layer for AI systems and agents: important pages, context, summaries, machine-readable entry points. One is more about \u201ewhere are you allowed to go?\u201c, while the other is more about \u201ethis is what\u2019s important here.\u201c.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Why should I check bot logs?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Because robots.txt only indicates your intent. Logs show what actually happens: which bots visit your site, which URLs they access, which status codes they receive, which firewall rules are triggered, and which important content is never reached.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"quellen\">Sources and Verification<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Google Search Central: <a href=\"https:\/\/developers.google.com\/search\/docs\/crawling-indexing\/robots\/intro\" target=\"_blank\" rel=\"noopener\">Introduction to robots.txt<\/a> and the limitations of robots.txt.<\/li>\n<li>Google Crawling Infrastructure: <a href=\"https:\/\/developers.google.com\/crawling\/docs\/crawlers-fetchers\/google-common-crawlers\" target=\"_blank\" rel=\"noopener\">Google's common crawlers<\/a>, specifically Googlebot, Google-CloudVertexBot, and Google-Extended.<\/li>\n<li>Google Search Central: <a href=\"https:\/\/developers.google.com\/search\/docs\/fundamentals\/ai-optimization-guide\" target=\"_blank\" rel=\"noopener\">Optimizing for Generative AI Features on Google Search<\/a>.<\/li>\n<li>OpenAI: <a href=\"https:\/\/developers.openai.com\/api\/docs\/bots\" target=\"_blank\" rel=\"noopener\">Overview of OpenAI Crawlers<\/a> with OAI-SearchBot, GPTBot, and ChatGPT users.<\/li>\n<li>Anthropic Help Center: <a href=\"https:\/\/support.claude.com\/en\/articles\/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler\" target=\"_blank\" rel=\"noopener\">Does Anthropic crawl data from the web?<\/a> with ClaudeBot, Claude-SearchBot, and Claude-User.<\/li>\n<li>Perplexity Docs: <a href=\"https:\/\/docs.perplexity.ai\/docs\/resources\/perplexity-crawlers\" target=\"_blank\" rel=\"noopener\">Perplexity Crawlers<\/a> with PerplexityBot and Perplexity users.<\/li>\n<li>Our own citelayer\u00ae audit and plugin practices: recurring patterns from WordPress audits, bot log reviews, Schema\/llms.txt compatibility, and AI visibility tests. These observations are used in the article to provide practical context, not as an external primary source.<\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>Not all AI crawlers are the same. If you want to clearly separate visibility, training, and user-triggered requests, you need more than just a knee-jerk robots.txt block.<\/p>","protected":false},"author":1,"featured_media":3796,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_et_pb_use_builder":"","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"categories":[764,754,13],"tags":[],"dipi_cpt_category":[],"class_list":["post-3795","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-visibility","category-ki-b2b","category-ratgeber"],"acf":[],"_links":{"self":[{"href":"https:\/\/isla-stud.io\/en\/wp-json\/wp\/v2\/posts\/3795","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/isla-stud.io\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/isla-stud.io\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/isla-stud.io\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/isla-stud.io\/en\/wp-json\/wp\/v2\/comments?post=3795"}],"version-history":[{"count":2,"href":"https:\/\/isla-stud.io\/en\/wp-json\/wp\/v2\/posts\/3795\/revisions"}],"predecessor-version":[{"id":3803,"href":"https:\/\/isla-stud.io\/en\/wp-json\/wp\/v2\/posts\/3795\/revisions\/3803"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/isla-stud.io\/en\/wp-json\/wp\/v2\/media\/3796"}],"wp:attachment":[{"href":"https:\/\/isla-stud.io\/en\/wp-json\/wp\/v2\/media?parent=3795"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/isla-stud.io\/en\/wp-json\/wp\/v2\/categories?post=3795"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/isla-stud.io\/en\/wp-json\/wp\/v2\/tags?post=3795"},{"taxonomy":"dipi_cpt_category","embeddable":true,"href":"https:\/\/isla-stud.io\/en\/wp-json\/wp\/v2\/dipi_cpt_category?post=3795"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}