Bulk robots.txt tester

A robots.txt tester, validator, and checker in one. Test up to 100 URLs with Google-faithful parsing — RFC 9309, wildcards, AdsBot quirks, crawl-delay warnings, all batch-tested in one click. Brought to you by Wygard, the best SEO monitoring tool.

1 Robots.txt source

Fetch the file directly from a domain (we proxy the request to bypass CORS), or paste its contents below.

2 User-agent

Pick a crawler to simulate, or enter a custom user-agent token.

3 URLs to test

One URL per line. Up to 100 at a time. Relative paths resolve against the robots.txt host.

Results

# URL Status Matched rule
robots.txt

Frequently asked questions

A quick checker for the things that usually trip people up.

What is a robots.txt file, and who reads it?

A plain-text file at the root of a domain (/robots.txt) that tells well-behaved crawlers which paths they can and can't request. It's a request, not a wall: malicious bots ignore it. Google, Bing, and all reputable crawlers obey it. Rules apply only to the host, scheme, and port where the file lives — https://example.com/robots.txt does not govern http://example.com/.

Why test it before going live?

One stray Disallow: / can deindex an entire site overnight. Migrations, staging leaks, accidental trailing slashes, and wildcard interactions regularly bite teams. This bulk tester and validator replays Google's matching logic across a batch of URLs so you can confirm exactly what will and won't be crawled before committing the change.

How do I find out my robots.txt was changed and I am not aware of it?

Use Wygard's robots.txt monitoring feature — you won't miss any change.

Does a Disallow entry also stop Google from indexing the URL?

No — Disallow only blocks crawling. Google can still index a disallowed URL (and show it in results without a snippet) based on external links pointing at it. To keep a page out of the index, use a noindex meta tag or the X-Robots-Tag HTTP header. Important catch: if a page is both disallowed in robots.txt and tagged noindex, Googlebot can't crawl the page to see the noindex — so the URL may still appear in results. Pick one, not both.

Does robots.txt work across subdomains and protocols?

No. Every (protocol, host, port) combination has its own robots.txt file. https://example.com/robots.txt does not govern https://m.example.com/, and it does not govern http://example.com/ either — HTTP and HTTPS are separate scopes. If you run a multi-subdomain site, every subdomain needs its own robots.txt served at its own root.

How do * and $ wildcards work?

* matches any sequence of characters, including an empty one. $ anchors the match to the end of the URL. Disallow: /*.pdf$ blocks every URL ending in .pdf, but not /report.pdf?v=2. Disallow: /search blocks /search, /search?q=x, and /search/results — anything starting with that prefix.

When Allow and Disallow both match, which wins?

The longer (more specific) rule wins, measured by path-pattern length. Disallow: /admin/ (7 chars) and Allow: /admin/public/ (14 chars) → /admin/public/page is allowed. When two rules tie in length, Allow wins. This is RFC 9309 §2.2.2.

Does Google respect Crawl-delay?

No. Crawl-delay is not in RFC 9309, and Googlebot ignores it. The tester flags any Crawl-delay line so you don't rely on it for Google. Bing, Yandex, Baidu, and Seznam do honour it — so the directive isn't useless, just not Google-facing. Use Search Console's crawl rate settings if you need Google to slow down.

Why do AdsBot crawlers behave differently?

A handful of Google crawlers — AdsBot-Google, AdsBot-Google-Mobile, Mediapartners-Google — explicitly opt out of the wildcard User-agent: * group. They require their own named block. This is by design: Google didn't want a blanket Disallow: / to silently break ad-quality checks. The tester mirrors this behaviour so your results match production.

How fast does a robots.txt change take effect for Google?

Not immediately. Google caches robots.txt for up to 24 hours, and may cache it longer when refreshing isn't possible. If you just tightened or loosened a rule, Googlebot may still use the previous version for most of a day. You can influence the cache window by setting a Cache-Control: max-age=… header on the robots.txt response. Search Console's robots.txt report always shows the exact copy Google is currently using — that's the source of truth for "is my latest version live yet?".

What happens with 4xx, 5xx, or redirects on the robots.txt itself?

4xx (except 429): Google treats the site as fully crawlable — no restrictions. That includes 404, 410, 401, 403 and any other client error apart from rate-limiting.

429 or 5xx: handled as a transient error. For the first 12 hours Google stops crawling and keeps retrying. For up to 30 days after that, Google falls back to the last cached robots.txt it successfully fetched. Past 30 days, if the rest of the site is reachable Google behaves as if there's no robots.txt; if the site is generally down, Google stops crawling.

3xx: Google follows at least 5 redirect hops, then treats the target as 404. Logical redirects (JavaScript, meta-refresh, HTML frames) are not followed. The tester also caps at 5 redirects.

Is there a size limit?

Yes. RFC 9309 lets parsers stop reading at 500 KiB, and Google does exactly that. Anything past that cutoff is silently ignored by the real Googlebot. If your robots.txt is above 500 KiB, the tester flags it.

Does this tool send my data anywhere?

All parsing and matching happens in your browser. The only server round-trip is through our proxy.php, which exists for one reason: browsers can't fetch robots.txt from other domains due to CORS. The proxy fetches the public file and returns its body. It doesn't log URLs, results, or anything else. If you paste the robots.txt content instead of fetching, nothing leaves your machine.