robots.txt Cheat Sheet
Every directive, syntax rule, and real-world pattern — in one place. Updated for 2026.
Generate robots.txt freeContents
1. Syntax & Structure
A robots.txt file is a plain text file at the root of your domain (https://yourdomain.com/robots.txt). It uses a simple key: value format.
# This is a comment User-agent: * # Which crawler this rule applies to Disallow: /private/ # Path the crawler cannot access Allow: /public/ # Overrides a Disallow for a specific path Crawl-delay: 10 # Seconds between requests (not supported by Google) Sitemap: https://yourdomain.com/sitemap.xml
• Each group starts with one or more User-agent lines followed by its rules.
• Groups are separated by a blank line.
• Path values are case-sensitive — /Admin and /admin are different.
• Sitemap directives are not group-specific — they apply globally.
2. Directives Reference
| Directive | Required | Description |
|---|---|---|
| User-agent: | Yes | Specifies which crawler the following rules apply to. Use * for all crawlers. |
| Disallow: | No | Path the crawler cannot access. An empty value means allow everything. |
| Allow: | No | Overrides a Disallow rule for a specific sub-path. |
| Sitemap: | No | Full URL to your XML sitemap. Can appear multiple times. |
| Crawl-delay: | No | Seconds to wait between requests. Supported by Bing/Yandex — Google ignores it. |
| Host: | No | Preferred domain (Yandex-specific). Not used by Google or Bing. |
| Clean-param: | No | Query parameters to ignore during crawling (Yandex-specific). |
Common User-agent Values
| User-agent | Crawler |
|---|---|
| * | All crawlers (catch-all) |
| Googlebot | Google (all content) |
| Googlebot-Image | Google Images |
| Googlebot-News | Google News |
| AdsBot-Google | Google Ads quality check |
| Bingbot | Microsoft Bing |
| DuckDuckBot | DuckDuckGo |
| Baiduspider | Baidu |
| YandexBot | Yandex |
| facebookexternalhit | Facebook link previews |
| Twitterbot | Twitter/X link previews |
| LinkedInBot | LinkedIn link previews |
| GPTBot | OpenAI training crawler |
| Claude-Web | Anthropic training crawler |
| CCBot | Common Crawl (AI training data) |
| Google-Extended | Google Gemini training data |
3. Wildcard Patterns
Google and Bing support two wildcard characters:
| Character | Meaning | Example | Effect |
|---|---|---|---|
| * | Any sequence of characters | Disallow: /*.pdf$ | Blocks all .pdf URLs sitewide |
| $ | End of URL | Disallow: /*.json$ | Blocks URLs ending in .json |
| * (mid-path) | Any path segment | Disallow: /user/*/private | Blocks /user/123/private, etc. |
# Block all PDFs Disallow: /*.pdf$ # Block tracking parameters Disallow: /*?ref= Disallow: /*?utm_ # Block paginated pages Disallow: /*?page= # Allow a sub-path of a blocked parent User-agent: Googlebot Disallow: /api/ Allow: /api/public/
4. Real-World Examples
Allow all crawlers
User-agent: * Allow: / Sitemap: https://yourdomain.com/sitemap.xml
Block all crawlers (staging site)
User-agent: * Disallow: /
WordPress
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Disallow: /wp-includes/ Disallow: /wp-content/plugins/ Disallow: /?s= Disallow: /tag/ Sitemap: https://yourdomain.com/sitemap_index.xml
E-commerce (Shopify / WooCommerce)
User-agent: * Disallow: /cart Disallow: /checkout Disallow: /account Disallow: /*?sort= Disallow: /*?filter= Allow: / Sitemap: https://yourdomain.com/sitemap.xml
Block AI training crawlers
User-agent: GPTBot Disallow: / User-agent: Claude-Web Disallow: / User-agent: CCBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: * Allow: / Sitemap: https://yourdomain.com/sitemap.xml
Next.js app
User-agent: * Disallow: /api/ Disallow: /admin/ Disallow: /login Disallow: /_next/ Allow: / Sitemap: https://yourdomain.com/api/sitemap.xml
5. Common Patterns Quick Reference
| Goal | Rule |
|---|---|
| Block entire site | Disallow: / |
| Allow entire site | Allow: / (or empty Disallow) |
| Block admin area | Disallow: /admin/ |
| Block login page | Disallow: /login |
| Block search results | Disallow: /search |
| Block all PDFs | Disallow: /*.pdf$ |
| Block any query string | Disallow: /*?* |
| Block specific param | Disallow: /*?ref= |
| Block staging path | Disallow: /staging/ |
| Allow sub-path of blocked parent | Disallow: /api/ Allow: /api/public/ |
| Add sitemap | Sitemap: https://example.com/sitemap.xml |
| Multiple sitemaps | Sitemap: https://example.com/sitemap-posts.xml Sitemap: https://example.com/sitemap-products.xml |
6. Common Mistakes
✗ Disallowing AND noindexing the same page
If Googlebot cannot crawl a page (due to Disallow), it cannot read the noindex tag. The page may still appear in results via external links. Fix: only noindex pages that Google can crawl.
✗ Using robots.txt to hide sensitive content
robots.txt is publicly visible — anyone can read it. Never list secret paths here. Use authentication or server-side access control for sensitive areas.
✗ Missing trailing slash on directory rules
Disallow: /admin blocks only the exact path /admin. Disallow: /admin/ blocks everything inside the directory. Use a trailing slash for directories.
✗ Blocking CSS and JavaScript files
Google needs to crawl your CSS/JS to render your pages correctly. Blocking /_next/ or /static/ prevents proper rendering and hurts rankings.
✗ Using Crawl-delay for Google
Google ignores Crawl-delay in robots.txt. To reduce crawl rate, use Search Console → Settings → Crawl rate.
✗ Untested wildcard rules
Disallow: /*?* blocks ALL URLs with any query parameter — including pagination and filters. Always test with the robots.txt tester in Google Search Console.
7. FAQ
Does robots.txt prevent a page from appearing in Google?
No. Disallow only stops crawling. If other pages link to a disallowed URL, Google can still index it (showing the URL without a description). To remove a page from results, use noindex on a crawlable page or the URL removal tool in Search Console.
Where must robots.txt be located?
At the root of the domain: https://yourdomain.com/robots.txt. A file at /subfolder/robots.txt is ignored by search engines. Each subdomain needs its own robots.txt.
Can I have multiple User-agent groups?
Yes — separate groups with a blank line. Specific user-agent rules take precedence over the * group. A crawler only follows its specific group, not a combination of its group and *.
Is robots.txt case-sensitive?
Directive names (User-agent, Disallow) are case-insensitive. Path values are case-sensitive on most servers — /Admin and /admin are different paths.
How quickly does Google update after a robots.txt change?
Google typically recrawls robots.txt within 24 hours. You can test changes immediately in Search Console robots.txt tester and request a re-fetch.
What happens if there is no robots.txt?
Search engines treat a missing robots.txt the same as a file with User-agent: * / Allow: / — everything is crawlable. A 404 on robots.txt is not a problem.
Generate a robots.txt file in seconds
Use our free robots.txt generator to build a correctly formatted file without manual editing.
Open robots.txt Generator