Question 1

What is a robots.txt file, and who reads it?

Accepted Answer

A plain-text file at the root of a domain (/robots.txt) that tells well-behaved crawlers which paths they can and can't request. It's a request, not a wall: malicious bots ignore it. Google, Bing, and all reputable crawlers obey it. Rules apply only to the host, scheme, and port where the file lives — https://example.com/robots.txt does not govern http://example.com/.

Question 2

Why test robots.txt before going live?

Accepted Answer

One stray 'Disallow: /' can deindex an entire site overnight. Migrations, staging leaks, accidental trailing slashes, and wildcard interactions regularly bite teams. This bulk tester and validator replays Google's matching logic across a batch of URLs so you can confirm exactly what will and won't be crawled before committing the change.

Question 3

How do I find out my robots.txt was changed and I am not aware of it?

Accepted Answer

Use Wygard's robots.txt monitoring feature — you won't miss any change.

Question 4

Does a Disallow entry also stop Google from indexing the URL?

Accepted Answer

No — Disallow only blocks crawling. Google can still index a disallowed URL (and show it in results without a snippet) based on external links pointing at it. To keep a page out of the index, use a noindex meta tag or the X-Robots-Tag HTTP header. Important catch: if a page is both disallowed in robots.txt and tagged noindex, Googlebot can't crawl the page to see the noindex — so the URL may still appear in results. Pick one, not both.

Question 5

Does robots.txt work across subdomains and protocols?

Accepted Answer

No. Every (protocol, host, port) combination has its own robots.txt file. https://example.com/robots.txt does not govern https://m.example.com/, and it does not govern http://example.com/ either — HTTP and HTTPS are separate scopes. If you run a multi-subdomain site, every subdomain needs its own robots.txt served at its own root.

Question 6

How do the * and $ wildcards work in robots.txt?

Accepted Answer

* matches any sequence of characters, including an empty one. $ anchors the match to the end of the URL. 'Disallow: /*.pdf$' blocks every URL ending in .pdf, but not '/report.pdf?v=2'. 'Disallow: /search' blocks /search, /search?q=x, and /search/results — anything starting with that prefix.

Question 7

When Allow and Disallow both match, which rule wins?

Accepted Answer

The longer (more specific) rule wins, measured by path-pattern length. 'Disallow: /admin/' (7 chars) and 'Allow: /admin/public/' (14 chars) → /admin/public/page is allowed. When two rules tie in length, Allow wins. This is RFC 9309 §2.2.2.

Question 8

Does Google respect the Crawl-delay directive?

Accepted Answer

No. Crawl-delay is not in RFC 9309, and Googlebot ignores it. The tester flags any Crawl-delay line so you don't rely on it for Google. Bing, Yandex, Baidu, and Seznam do honour it — so the directive isn't useless, just not Google-facing. Use Search Console's crawl rate settings if you need Google to slow down.

Question 9

Why do AdsBot crawlers behave differently from Googlebot?

Accepted Answer

A handful of Google crawlers — AdsBot-Google, AdsBot-Google-Mobile, Mediapartners-Google — explicitly opt out of the wildcard 'User-agent: *' group. They require their own named block. This is by design: Google didn't want a blanket 'Disallow: /' to silently break ad-quality checks. The tester mirrors this behaviour so your results match production.

Question 10

How fast does a robots.txt change take effect for Google?

Accepted Answer

Not immediately. Google caches robots.txt for up to 24 hours, and may cache it longer when refreshing isn't possible. If you just tightened or loosened a rule, Googlebot may still use the previous version for most of a day. You can influence the cache window by setting a Cache-Control: max-age=… header on the robots.txt response. Search Console's robots.txt report always shows the exact copy Google is currently using — that's the source of truth for 'is my latest version live yet?'.

Question 11

What happens with 4xx, 5xx, or redirects on the robots.txt file itself?

Accepted Answer

4xx (except 429): Google treats the site as fully crawlable — no restrictions. That includes 404, 410, 401, 403 and any other client error apart from rate-limiting. 429 or 5xx: handled as a transient error. For the first 12 hours Google stops crawling and keeps retrying. For up to 30 days after that, Google falls back to the last cached robots.txt it successfully fetched. Past 30 days, if the rest of the site is reachable Google behaves as if there's no robots.txt; if the site is generally down, Google stops crawling. 3xx: Google follows at least 5 redirect hops, then treats the target as 404. Logical redirects (JavaScript, meta-refresh, HTML frames) are not followed. The tester also caps at 5 redirects.

Question 12

Is there a size limit for robots.txt?

Accepted Answer

Yes. RFC 9309 lets parsers stop reading at 500 KiB, and Google does exactly that. Anything past that cutoff is silently ignored by the real Googlebot. If your robots.txt is above 500 KiB, the tester flags it.

Question 13

Does this tool send my data anywhere?

Accepted Answer

All parsing and matching happens in your browser. The only server round-trip is through our proxy.php, which exists for one reason: browsers can't fetch robots.txt from other domains due to CORS. The proxy fetches the public file and returns its body. It doesn't log URLs, results, or anything else. If you paste the robots.txt content instead of fetching, nothing leaves your machine.

Bulk robots.txt tester

1 Robots.txt source

2 User-agent

3 URLs to test

Results

Frequently asked questions