robots.txt vs noindex
Confusing robots.txt with noindex is one of the most common SEO mistakes. One controls whether Google visits a page; the other controls whether Google shows it in search results.
| Feature | robots.txt | noindex |
|---|---|---|
| What it controls | Controls whether search engine crawlers can access a page. | Controls whether an accessed page appears in search results. |
| Format | Plain text file at the root of your domain (/robots.txt). | HTML meta tag (<meta name="robots" content="noindex">) or HTTP header. |
| Scope | Applies to crawlers — blocks the request before the page is read. | Applies to indexing — the page is crawled but excluded from results. |
| Can still be indexed? | Yes — Google can index a disallowed URL if other pages link to it. | No — noindex reliably removes the page from search results. |
| Passes PageRank | Links on blocked pages are not followed — no link equity passed. | Links on noindex pages can still be followed and pass PageRank. |
| Use for sensitive pages | Not reliable — page can still appear in results via external links. | Reliable — Google will remove the page from its index. |
| Response time | Immediate — robots.txt is checked before any crawl request. | Takes effect after Google next crawls the page (days to weeks). |
robots.txt Pros & Cons
Pros
- Saves crawl budget — stops Google wasting time on unimportant pages
- Instantly prevents crawling of entire directories
- Useful for staging environments and internal search pages
- Simple to implement — one text file
Cons
- Does NOT prevent indexing — disallowed pages can still appear in Google
- Blocks PageRank flow through links on those pages
- Cannot reliably hide sensitive content from appearing in search results
noindex Pros & Cons
Pros
- Reliably removes pages from Google search results
- Links on noindex pages still pass PageRank
- Can be set per-page with fine-grained control
- Works via HTTP header — useful for PDFs and non-HTML content
Cons
- Page must be crawlable for noindex to be read and obeyed
- Takes time — Google needs to recrawl the page to de-index it
- Cannot reduce crawl budget — Google still visits the page
Verdict
Use robots.txt to save crawl budget on low-value pages (search results, filters, duplicate content). Use noindex to reliably exclude pages from search results (thank-you pages, staging, admin pages). Never block a page in robots.txt AND add noindex — Google cannot read a noindex tag on a page it's not allowed to crawl, so the noindex is ignored.