Robots.txt Guide: Create, Test & Optimize for SEO | SlapMyWeb
Technical SEO10 min read
Robots.txt Guide: Create, Test, and Optimize for SEO
A complete robots.txt guide: create, test, and optimize the file that controls crawlers β plus the disallow vs noindex trap that deindexes sites.
SlapMyWeb TeamΒ·
Robots.txt is a plain-text file at the root of your domain (example.com/robots.txt) that tells search engine crawlers which URLs they may and may not request. It is the first file Googlebot fetches on every visit, it governs crawl budget rather than indexing, and one mistaken line β Disallow: / β can quietly wall your entire site off from search. This guide covers how to create a robots.txt file correctly, test it before it goes live, and tune it so crawlers spend their budget on pages that actually matter.
The single most important thing to know up front: robots.txt controls crawling, not indexing. A blocked page can still appear in Google as a bare URL if other sites link to it. If your goal is to remove a page from results, you need noindex, not Disallow β more on that distinction below.
What Robots.txt Is and Why It Matters
Robots.txt is part of the Robots Exclusion Protocol, a standard every major search engine respects and which Google helped formalize as RFC 9309. The file lives at exactly one location β the root of each host β and applies only to that host and protocol. So https://example.com/robots.txt governs https://example.com/, while https://blog.example.com/ and http://example.com/ each need their own file.
Think of it as a posted notice at your front door rather than a locked gate. It does not stop a determined visitor from requesting a URL β the page is still publicly reachable β but well-behaved crawlers like Googlebot, Bingbot, and most reputable SEO tools read the notice and obey it. Malicious scrapers ignore it entirely, which is why robots.txt is never a security control.
Every site should have one. Even if you want every page crawled, a permissive robots.txt states that intent explicitly and gives you a place to declare your sitemap.
Developer reviewing a robots.txt file open in a code editor on a desktop monitor
How Crawlers Read the File
When a bot arrives, it requests /robots.txt before crawling anything else. Google caches the response (generally up to 24 hours, longer if your server is unreachable) and applies it to every crawl decision until the cache refreshes. The file is built from a handful of directives:
`User-agent` β names the crawler a block of rules applies to
`Disallow` β a path prefix the bot should not request
`Allow` β a path prefix that overrides a broader Disallow
`Sitemap` β the absolute URL of an XML sitemap (host-independent)
`Crawl-delay` β seconds between requests; respected by Bing and Yandex, ignored by Google
Two rules decide outcomes when directives conflict. First, a crawler obeys only the most specific matching `User-agent` group β if a block names Googlebot, Googlebot ignores the User-agent: * block entirely. Second, within that group, the most specific (longest) path match wins, and on a tie, the less restrictive rule (Allow) wins. Get those two rules wrong and you produce the silent failures that show up in audits every week.
Directive Syntax in Detail
The grammar is small but unforgiving β a stray character changes meaning.
Directive
Effect
User-agent: *
Rules apply to all crawlers without a more specific block
User-agent: Googlebot
Rules apply only to Google's main crawler
Disallow: (empty)
Allow everything β the explicit "open door"
Disallow: /
Block the entire site
Disallow: /admin/
Block the /admin/ directory and everything under it
Allow: /admin/public/
Carve an exception out of a blocked parent
Disallow: /*.pdf$
Block all URLs ending in .pdf
Sitemap: https://example.com/sitemap.xml
Declare a sitemap (use the full absolute URL)
Two wildcards are supported by Google: * matches any sequence of characters, and $ anchors a rule to the end of the URL. So Disallow: /*? blocks any URL containing a query string, while Disallow: /private (no trailing slash) blocks both /private/and/private-report.html β a classic, expensive mistake.
1. Start With a Minimal Working File
Every robots.txt begins with at least one User-agent line. The smallest useful file allows everything and points to your sitemap:
text
# Allow all bots to crawl everything
User-agent: *
Disallow:
# Sitemap location
Sitemap: https://example.com/sitemap.xml
Save it as plain UTF-8 text named exactly robots.txt and place it at the web root so it resolves at https://example.com/robots.txt. This is already better than no file at all β it confirms intent and advertises your sitemap. (If you haven't built one yet, start with the XML sitemap guide.)
2. Block Admin, Private, and Duplicate Paths
Most sites have areas that should never burn crawl budget: admin panels, cart and checkout flows, and parameter-driven duplicates such as faceted-filter URLs.
The Allow lines for CSS and JS are not optional. Google renders pages to evaluate them, and a blocked stylesheet or script can make a page look broken to the crawler. If parameter duplicates are a recurring problem, pair these rules with proper canonical tags β robots.txt manages crawl, canonicals manage which version ranks.
3. Decide Your Stance on AI Crawlers
AI training and answer-engine crawlers now make up a real share of bot traffic, and they identify themselves with their own user-agents. You can allow, throttle, or block them independently of search crawlers.
text
# Block AI training crawlers (a business decision β see note)
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: CCBot
Disallow: /
# Google-Extended controls Gemini training + some AI features
User-agent: Google-Extended
Disallow: /
# Search crawling stays fully open
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
Blocking AI bots is a trade-off, not a best practice. If you want to be cited in AI answers and Google AI Overviews, blocking Google-Extended and GPTBot works against you. This is the same tension covered in AEO vs SEO and the guide on getting featured in AI Overviews β and it's why some teams now publish an llms.txt file alongside robots.txt to signal what AI systems may use.
4. Declare Every Sitemap
Always list your sitemaps, usually at the bottom of the file. This lets crawlers discover them even before you submit anything in Search Console, and Sitemap directives are host-independent β they can point anywhere.
Each line must begin with a capital-S Sitemap: followed by the full absolute URL. Relative paths are not valid here.
Two colleagues at a desk discussing crawl rules shown on a laptop screen in an office
A Complete, Production-Ready Example
Here is a robots.txt for a typical WordPress business site, combining everything above:
text
# robots.txt for example.com
# Docs: https://slapmyweb.com/blog/robots-txt-guide-create-test-optimize-seo
# Default rules for all crawlers
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /account/
Disallow: /cart/
Disallow: /checkout/
Disallow: /thank-you/
Disallow: /search?
Disallow: /*?utm_
Disallow: /*?ref=
Disallow: /*?session=
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /xmlrpc.php
Disallow: /private/
# Required so Google can render pages
Allow: /wp-admin/admin-ajax.php
Allow: /wp-includes/*.js
Allow: /wp-includes/*.css
Allow: /wp-content/uploads/
Allow: /wp-content/themes/*.css
Allow: /wp-content/themes/*.js
# AI training crawlers (remove if you want AI visibility)
User-agent: GPTBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Throttle aggressive third-party crawlers (Google ignores Crawl-delay)
User-agent: AhrefsBot
Crawl-delay: 10
User-agent: SemrushBot
Crawl-delay: 10
# Sitemaps
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/blog-sitemap.xml
Adapt the paths to your own URL structure β copying another site's robots.txt blindly is how teams end up blocking directories they actually need crawled.
5. Test Before You Deploy
A syntax slip in robots.txt can deindex pages over days without any error message, so testing is mandatory, not optional. Work through these checks:
Open it in a browser. Visit yoursite.com/robots.txt and confirm it returns the file with a 200 status and Content-Type: text/plain. If it returns a 5xx, Google may treat the whole site as disallowed until the file is reachable again. (See the HTTP status codes guide for why 5xx here is so dangerous.)
Use Google Search Console. The robots.txt report shows the version Google has cached, flags parse errors, and lets you request a refresh after a change.
Inspect real URLs. Run important pages through Search Console's URL Inspection tool to confirm they are still Crawl allowed β don't wait for the next natural recrawl.
Run a full crawl audit.Run a free SlapMyWeb audit to see whether robots.txt is accidentally blocking pages or rendering resources your rankings depend on.
These are the errors that surface again and again in technical audits:
Blocking CSS or JavaScript. Google can't fully render the page, and rankings drop. Never Disallow.css or .js.
Blocking image directories. You forfeit Google Images traffic for no benefit.
Accidental `Disallow: /`. Usually a staging rule that shipped to production. It blocks the entire site.
Missing trailing slash.Disallow: /blog also blocks /blog-post-title; Disallow: /blog/ blocks only the directory.
Returning `5xx` on the robots.txt URL. Google may stop crawling the site entirely until it's fixed.
Using robots.txt to "hide" a page that's already indexed. Blocking crawl prevents Google from ever seeing the noindex you added β so the page stays in the index.
No sitemap reference. You make discovery slower than it needs to be.
Hands typing on a keyboard with a website crawl audit dashboard on the monitor
Robots.txt vs Noindex: The Distinction That Trips Everyone Up
This is the single most misunderstood point in crawl control, so it's worth stating plainly.
`Disallow` (robots.txt) stops the bot from requesting the page. The page is never crawled β but if external sites link to it, Google can still index the URL as a snippetless result.
`noindex` (meta tag or `X-Robots-Tag` header) lets the bot crawl the page, read the directive, and drop it from the index.
The trap: if you want a page gone from search, do not block it in robots.txt. Google has to be able to crawl the page to see the noindex. Block it instead, and the noindex is invisible β the URL lingers in results indefinitely.
Goal
Right tool
Save crawl budget on large/duplicate sections
Disallow in robots.txt
Remove a page from search results
noindex (crawlable, not disallowed)
Keep a private file unreachable
Authentication / server rules β not robots.txt
Use robots.txt for crawl-budget shaping; use noindex for index control. They solve different problems.
Where Robots.txt Fits in Your Technical SEO
Robots.txt is one lever in the crawl-and-indexation pillar, alongside sitemaps, canonical tags, and status-code hygiene. It pairs most directly with your XML sitemap β disallow the noise, advertise the signal. For the full picture of how these pieces fit together, the complete technical SEO guide maps every component, and the step-by-step SEO audit walkthrough shows how to verify them as a system rather than in isolation.
Frequently Asked Questions
Can robots.txt block a page from appearing in Google search results?
No, not reliably. Robots.txt prevents crawling, but if other pages link to a blocked URL, Google can still index it as a snippetless "URL-only" result. To remove a page from search results, use a noindex meta tag or an X-Robots-Tag HTTP header on a page that is not disallowed, so Google can crawl it and see the directive.
How often does Google check robots.txt?
Google generally caches robots.txt for up to 24 hours, though it may hold the cached copy longer if your server returns errors. Every crawl decision uses the cached version until it refreshes. After an urgent change, open the robots.txt report in Google Search Console and request a recrawl rather than waiting for the cache to expire.
Should I block AI bots like GPTBot in robots.txt?
It depends on your goals. Blocking crawlers such as GPTBot, CCBot, and Google-Extended keeps your content out of AI training and some AI features, but it can also reduce your visibility in AI Overviews and answer engines. If appearing in AI answers matters to you, keep those crawlers allowed and manage usage through other means instead.
What happens if I have no robots.txt file at all?
If no robots.txt exists, crawlers assume they may request everything on the site. Your content still gets found, but bots waste budget on parameter URLs, internal search pages, and other low-value paths, and admin directories aren't kept out of the crawl. A minimal robots.txt that allows everything and declares your sitemap is almost always better than no file.
Where exactly should the robots.txt file be located?
It must sit at the root of each host and be served over the matching protocol β https://example.com/robots.txt controls only https://example.com/. Subdomains like blog.example.com and separate protocols need their own files. A robots.txt placed in a subdirectory (for example /blog/robots.txt) is ignored by crawlers entirely.