Reference Guide

robots.txt Cheat Sheet

Every directive, syntax rule, and real-world pattern — in one place. Updated for 2026.

1. Syntax & Structure
2. Directives Reference
3. Wildcard Patterns
4. Real-World Examples
5. Common Patterns
6. Common Mistakes
7. FAQ

1. Syntax & Structure

A robots.txt file is a plain text file at the root of your domain (https://yourdomain.com/robots.txt). It uses a simple key: value format.

# This is a comment
User-agent: *          # Which crawler this rule applies to
Disallow: /private/    # Path the crawler cannot access
Allow: /public/        # Overrides a Disallow for a specific path
Crawl-delay: 10        # Seconds between requests (not supported by Google)

Sitemap: https://yourdomain.com/sitemap.xml

• Each group starts with one or more User-agent lines followed by its rules.

• Groups are separated by a blank line.

• Path values are case-sensitive — /Admin and /admin are different.

• Sitemap directives are not group-specific — they apply globally.

2. Directives Reference

Directive	Required	Description
User-agent:	Yes	Specifies which crawler the following rules apply to. Use * for all crawlers.
Disallow:	No	Path the crawler cannot access. An empty value means allow everything.
Allow:	No	Overrides a Disallow rule for a specific sub-path.
Sitemap:	No	Full URL to your XML sitemap. Can appear multiple times.
Crawl-delay:	No	Seconds to wait between requests. Supported by Bing/Yandex — Google ignores it.
Host:	No	Preferred domain (Yandex-specific). Not used by Google or Bing.
Clean-param:	No	Query parameters to ignore during crawling (Yandex-specific).

Common User-agent Values

User-agent	Crawler
*	All crawlers (catch-all)
Googlebot	Google (all content)
Googlebot-Image	Google Images
Googlebot-News	Google News
AdsBot-Google	Google Ads quality check
Bingbot	Microsoft Bing
DuckDuckBot	DuckDuckGo
Baiduspider	Baidu
YandexBot	Yandex
facebookexternalhit	Facebook link previews
Twitterbot	Twitter/X link previews
LinkedInBot	LinkedIn link previews
GPTBot	OpenAI training crawler
Claude-Web	Anthropic training crawler
CCBot	Common Crawl (AI training data)
Google-Extended	Google Gemini training data

3. Wildcard Patterns

Google and Bing support two wildcard characters:

Character	Meaning	Example	Effect
*	Any sequence of characters	Disallow: /*.pdf$	Blocks all .pdf URLs sitewide
$	End of URL	Disallow: /*.json$	Blocks URLs ending in .json
* (mid-path)	Any path segment	Disallow: /user/*/private	Blocks /user/123/private, etc.

# Block all PDFs
Disallow: /*.pdf$

# Block tracking parameters
Disallow: /*?ref=
Disallow: /*?utm_

# Block paginated pages
Disallow: /*?page=

# Allow a sub-path of a blocked parent
User-agent: Googlebot
Disallow: /api/
Allow: /api/public/

4. Real-World Examples

Allow all crawlers

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Block all crawlers (staging site)

User-agent: *
Disallow: /

WordPress

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /?s=
Disallow: /tag/

Sitemap: https://yourdomain.com/sitemap_index.xml

E-commerce (Shopify / WooCommerce)

User-agent: *
Disallow: /cart
Disallow: /checkout
Disallow: /account
Disallow: /*?sort=
Disallow: /*?filter=
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Block AI training crawlers

User-agent: GPTBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Next.js app

User-agent: *
Disallow: /api/
Disallow: /admin/
Disallow: /login
Disallow: /_next/
Allow: /

Sitemap: https://yourdomain.com/api/sitemap.xml

5. Common Patterns Quick Reference

Goal	Rule
Block entire site	Disallow: /
Allow entire site	Allow: / (or empty Disallow)
Block admin area	Disallow: /admin/
Block login page	Disallow: /login
Block search results	Disallow: /search
Block all PDFs	Disallow: /*.pdf$
Block any query string	Disallow: /?
Block specific param	Disallow: /*?ref=
Block staging path	Disallow: /staging/
Allow sub-path of blocked parent	Disallow: /api/ Allow: /api/public/
Add sitemap	Sitemap: https://example.com/sitemap.xml
Multiple sitemaps	Sitemap: https://example.com/sitemap-posts.xml Sitemap: https://example.com/sitemap-products.xml

6. Common Mistakes

✗ Disallowing AND noindexing the same page

If Googlebot cannot crawl a page (due to Disallow), it cannot read the noindex tag. The page may still appear in results via external links. Fix: only noindex pages that Google can crawl.

✗ Using robots.txt to hide sensitive content

robots.txt is publicly visible — anyone can read it. Never list secret paths here. Use authentication or server-side access control for sensitive areas.

✗ Missing trailing slash on directory rules

Disallow: /admin blocks only the exact path /admin. Disallow: /admin/ blocks everything inside the directory. Use a trailing slash for directories.

✗ Blocking CSS and JavaScript files

Google needs to crawl your CSS/JS to render your pages correctly. Blocking /_next/ or /static/ prevents proper rendering and hurts rankings.

✗ Using Crawl-delay for Google

Google ignores Crawl-delay in robots.txt. To reduce crawl rate, use Search Console → Settings → Crawl rate.

✗ Untested wildcard rules

Disallow: /*?* blocks ALL URLs with any query parameter — including pagination and filters. Always test with the robots.txt tester in Google Search Console.

7. FAQ

Does robots.txt prevent a page from appearing in Google?

No. Disallow only stops crawling. If other pages link to a disallowed URL, Google can still index it (showing the URL without a description). To remove a page from results, use noindex on a crawlable page or the URL removal tool in Search Console.

Where must robots.txt be located?

At the root of the domain: https://yourdomain.com/robots.txt. A file at /subfolder/robots.txt is ignored by search engines. Each subdomain needs its own robots.txt.

Can I have multiple User-agent groups?

Yes — separate groups with a blank line. Specific user-agent rules take precedence over the * group. A crawler only follows its specific group, not a combination of its group and *.

Is robots.txt case-sensitive?

Directive names (User-agent, Disallow) are case-insensitive. Path values are case-sensitive on most servers — /Admin and /admin are different paths.

How quickly does Google update after a robots.txt change?

Google typically recrawls robots.txt within 24 hours. You can test changes immediately in Search Console robots.txt tester and request a re-fetch.

What happens if there is no robots.txt?

Search engines treat a missing robots.txt the same as a file with User-agent: * / Allow: / — everything is crawlable. A 404 on robots.txt is not a problem.

Generate a robots.txt file in seconds

Use our free robots.txt generator to build a correctly formatted file without manual editing.

Open robots.txt Generator