Need a Custom Web Solution?

Professional web development services available

Reference Guide

robots.txt Cheat Sheet

Every directive, syntax rule, and real-world pattern — in one place. Updated for 2026.

Generate robots.txt free

1. Syntax & Structure

A robots.txt file is a plain text file at the root of your domain (https://yourdomain.com/robots.txt). It uses a simple key: value format.

# This is a comment
User-agent: *          # Which crawler this rule applies to
Disallow: /private/    # Path the crawler cannot access
Allow: /public/        # Overrides a Disallow for a specific path
Crawl-delay: 10        # Seconds between requests (not supported by Google)

Sitemap: https://yourdomain.com/sitemap.xml

• Each group starts with one or more User-agent lines followed by its rules.

• Groups are separated by a blank line.

• Path values are case-sensitive/Admin and /admin are different.

Sitemap directives are not group-specific — they apply globally.

2. Directives Reference

DirectiveRequiredDescription
User-agent:YesSpecifies which crawler the following rules apply to. Use * for all crawlers.
Disallow:NoPath the crawler cannot access. An empty value means allow everything.
Allow:NoOverrides a Disallow rule for a specific sub-path.
Sitemap:NoFull URL to your XML sitemap. Can appear multiple times.
Crawl-delay:NoSeconds to wait between requests. Supported by Bing/Yandex — Google ignores it.
Host:NoPreferred domain (Yandex-specific). Not used by Google or Bing.
Clean-param:NoQuery parameters to ignore during crawling (Yandex-specific).

Common User-agent Values

User-agentCrawler
*All crawlers (catch-all)
GooglebotGoogle (all content)
Googlebot-ImageGoogle Images
Googlebot-NewsGoogle News
AdsBot-GoogleGoogle Ads quality check
BingbotMicrosoft Bing
DuckDuckBotDuckDuckGo
BaiduspiderBaidu
YandexBotYandex
facebookexternalhitFacebook link previews
TwitterbotTwitter/X link previews
LinkedInBotLinkedIn link previews
GPTBotOpenAI training crawler
Claude-WebAnthropic training crawler
CCBotCommon Crawl (AI training data)
Google-ExtendedGoogle Gemini training data

3. Wildcard Patterns

Google and Bing support two wildcard characters:

CharacterMeaningExampleEffect
*Any sequence of charactersDisallow: /*.pdf$Blocks all .pdf URLs sitewide
$End of URLDisallow: /*.json$Blocks URLs ending in .json
* (mid-path)Any path segmentDisallow: /user/*/privateBlocks /user/123/private, etc.
# Block all PDFs
Disallow: /*.pdf$

# Block tracking parameters
Disallow: /*?ref=
Disallow: /*?utm_

# Block paginated pages
Disallow: /*?page=

# Allow a sub-path of a blocked parent
User-agent: Googlebot
Disallow: /api/
Allow: /api/public/

4. Real-World Examples

Allow all crawlers

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Block all crawlers (staging site)

User-agent: *
Disallow: /

WordPress

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /?s=
Disallow: /tag/

Sitemap: https://yourdomain.com/sitemap_index.xml

E-commerce (Shopify / WooCommerce)

User-agent: *
Disallow: /cart
Disallow: /checkout
Disallow: /account
Disallow: /*?sort=
Disallow: /*?filter=
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Block AI training crawlers

User-agent: GPTBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Next.js app

User-agent: *
Disallow: /api/
Disallow: /admin/
Disallow: /login
Disallow: /_next/
Allow: /

Sitemap: https://yourdomain.com/api/sitemap.xml

5. Common Patterns Quick Reference

GoalRule
Block entire siteDisallow: /
Allow entire siteAllow: / (or empty Disallow)
Block admin areaDisallow: /admin/
Block login pageDisallow: /login
Block search resultsDisallow: /search
Block all PDFsDisallow: /*.pdf$
Block any query stringDisallow: /*?*
Block specific paramDisallow: /*?ref=
Block staging pathDisallow: /staging/
Allow sub-path of blocked parentDisallow: /api/ Allow: /api/public/
Add sitemapSitemap: https://example.com/sitemap.xml
Multiple sitemapsSitemap: https://example.com/sitemap-posts.xml Sitemap: https://example.com/sitemap-products.xml

6. Common Mistakes

Disallowing AND noindexing the same page

If Googlebot cannot crawl a page (due to Disallow), it cannot read the noindex tag. The page may still appear in results via external links. Fix: only noindex pages that Google can crawl.

Using robots.txt to hide sensitive content

robots.txt is publicly visible — anyone can read it. Never list secret paths here. Use authentication or server-side access control for sensitive areas.

Missing trailing slash on directory rules

Disallow: /admin blocks only the exact path /admin. Disallow: /admin/ blocks everything inside the directory. Use a trailing slash for directories.

Blocking CSS and JavaScript files

Google needs to crawl your CSS/JS to render your pages correctly. Blocking /_next/ or /static/ prevents proper rendering and hurts rankings.

Using Crawl-delay for Google

Google ignores Crawl-delay in robots.txt. To reduce crawl rate, use Search Console → Settings → Crawl rate.

Untested wildcard rules

Disallow: /*?* blocks ALL URLs with any query parameter — including pagination and filters. Always test with the robots.txt tester in Google Search Console.

7. FAQ

Does robots.txt prevent a page from appearing in Google?

No. Disallow only stops crawling. If other pages link to a disallowed URL, Google can still index it (showing the URL without a description). To remove a page from results, use noindex on a crawlable page or the URL removal tool in Search Console.

Where must robots.txt be located?

At the root of the domain: https://yourdomain.com/robots.txt. A file at /subfolder/robots.txt is ignored by search engines. Each subdomain needs its own robots.txt.

Can I have multiple User-agent groups?

Yes — separate groups with a blank line. Specific user-agent rules take precedence over the * group. A crawler only follows its specific group, not a combination of its group and *.

Is robots.txt case-sensitive?

Directive names (User-agent, Disallow) are case-insensitive. Path values are case-sensitive on most servers — /Admin and /admin are different paths.

How quickly does Google update after a robots.txt change?

Google typically recrawls robots.txt within 24 hours. You can test changes immediately in Search Console robots.txt tester and request a re-fetch.

What happens if there is no robots.txt?

Search engines treat a missing robots.txt the same as a file with User-agent: * / Allow: / — everything is crawlable. A 404 on robots.txt is not a problem.

Generate a robots.txt file in seconds

Use our free robots.txt generator to build a correctly formatted file without manual editing.

Open robots.txt Generator
Need Professional Web Development?

Transform Your Ideas Into Reality

Looking for a custom web app, website, or digital solution? Our expert team brings your vision to life with cutting-edge technology and stunning design.